= "GCATCGC" db
Suffix Array
Suffix Array
One motivation behind suffix array is that any position in our database that is a match to any substring of our query sequence can be thought of as the beginning of a suffix.
To illustrate:
If we wanted to query the sequence C
in database of GCATCGC
then there would be 3 matching suffixes.
1. One suffix starts from the first C
character at index 1.
2. The second suffix starts from the second C
character at index 4.
3. The third suffix starts from the third C
character at index 6.
To be able to do this, we first need to create an array of suffixes starting at each character of our database, then sort the array of suffixes alphabetically.
= {i:db[i:] for i,c in enumerate(db)}
suffixes print('Suffixes:\t\t', suffixes)
Suffixes: {0: 'GCATCGC', 1: 'CATCGC', 2: 'ATCGC', 3: 'TCGC', 4: 'CGC', 5: 'GC', 6: 'C'}
= {k: v for k, v in sorted(suffixes.items(), key=lambda item: item[1])}
sorted_suffixes = list(sorted_suffixes.keys())
sorted_suffix_keys print('Sorted Suffixes:\t', sorted_suffixes)
print('Sorted Suffix Keys:\t', sorted_suffix_keys)
Sorted Suffixes: {2: 'ATCGC', 6: 'C', 1: 'CATCGC', 4: 'CGC', 5: 'GC', 0: 'GCATCGC', 3: 'TCGC'}
Sorted Suffix Keys: [2, 6, 1, 4, 5, 0, 3]
Suffix Array Search
We can do pattern matching on suffix array using binary search. Below we tried to search for suffixes matching the query GC
.
= 'GC'
query = 0
low = len(suffixes) - 1
high = False
found
while high > low and not found:
= (low + high) // 2
mid
= sorted_suffix_keys[mid]
mid_key = sorted_suffix_keys[low]
low_key = sorted_suffix_keys[high]
high_key
if query == suffixes[mid_key][:len(query)]:
= True
found while query not in suffixes[low_key][:len(query)] and suffixes[low_key][:len(query)] < query and low < mid:
+= 1
low while query not in suffixes[high_key][:len(query)] and suffixes[high_key][:len(query)] > query and high > mid:
-= 1
high
elif suffixes[mid_key][:len(query)] > query:
= mid - 1
high if high < 0:
print('out of bound, not found')
elif suffixes[mid_key][:len(query)] < query:
= mid + 1
low if low >= len(suffixes):
print('out of bound, not found')
print('-'*100)
print('Query:\n', query)
print('Original database:\n', db)
print('Sorted suffix array:\n', sorted_suffixes, '\n')
if found:
= sorted_suffix_keys[low]
low_key = sorted_suffix_keys[high]
high_key print(f'Query "{query}" has a match at positions in the sorted suffix array:\t[{low_key}: {high_key}]')
else:
print(f'No match found for query {query}.')
print('-'*100)
----------------------------------------------------------------------------------------------------
Query:
GC
Original database:
GCATCGC
Sorted suffix array:
{2: 'ATCGC', 6: 'C', 1: 'CATCGC', 4: 'CGC', 5: 'GC', 0: 'GCATCGC', 3: 'TCGC'}
Query "GC" has a match at positions in the sorted suffix array: [5: 0]
----------------------------------------------------------------------------------------------------
Let’s make this as a function and do searching on some more queries.
def get_suffixes(db:str) -> tuple:
"""Create suffix array and sort it in alphabetical order by the suffix sequences"""
= {i:db[i:] for i,c in enumerate(db)}
suffixes = {k: v for k, v in sorted(suffixes.items(), key=lambda item: item[1])}
sorted_suffixes return suffixes, sorted_suffixes
def query_suffix_array(query:str, db:str, verbose:bool=False) -> list:
"""Search for query in the preprocessed suffix array using binary search.
Returns a list of matched suffix sequences.
"""
= get_suffixes(db)
suffixes, sorted_suffixes = list(sorted_suffixes.keys())
sorted_suffix_keys
= 0
low = len(suffixes) - 1
high = False
found
while high > low and not found:
= (low + high) // 2
mid
= sorted_suffix_keys[mid]
mid_key = sorted_suffix_keys[low]
low_key = sorted_suffix_keys[high]
high_key if verbose: print('low key', low_key, 'mid key', mid_key, 'high key', high_key)
if query == suffixes[mid_key] or query in suffixes[mid_key][:len(query)]:
= True
found while query not in suffixes[low_key][:len(query)] and suffixes[low_key] < query and low < mid:
+= 1
low = sorted_suffix_keys[low]
low_key
if verbose: print('found bound, new low', sorted_suffix_keys[low])
if query == suffixes[low_key][:len(query)]:
break
while query not in suffixes[high_key][:len(query)] and suffixes[high_key] > query and high > mid:
-= 1
high = sorted_suffix_keys[high]
high_key
if verbose: print('found bound, new high1', sorted_suffix_keys[high])
if query == suffixes[high_key][:len(query)]:
break
while query in suffixes[sorted_suffix_keys[high+1]][:len(query)]:
+= 1
high if verbose: print('found bound, new high2', sorted_suffix_keys[high])
elif suffixes[mid_key] > query:
= mid
high if verbose: print('move high=mid')
if high < 0:
print('out of bound, not found')
elif suffixes[mid_key] < query:
= mid
low if verbose: print('move low=mid')
if low >= len(suffixes):
print('out of bound, not found')
print('-'*100)
print('Query:\n', query)
print('Original database:\n', db)
print('Sorted suffix array:\n', sorted_suffixes, '\n')
if found:
= sorted_suffix_keys[low]
low_key = sorted_suffix_keys[high]
high_key = [low_key, high_key]
result
= sorted_suffix_keys.index(result[0])
start_result_index = sorted_suffix_keys.index(result[1])
end_result_index = sorted_suffix_keys[start_result_index : end_result_index+1]
result_keys = [sorted_suffixes[k] for k in sorted_suffixes if k in result_keys]
matched_result
print('-'*100)
print(f'Query "{query}" has a match at positions in the sorted suffix array:\t[{low_key}: {high_key}]')
print('Matched suffixes:\t', matched_result)
print('-'*100)
return matched_result
print('-'*100)
print(f'No match found for query {query}.')
print('-'*100)
return []
= "ACC"
query query_suffix_array(query, db)
----------------------------------------------------------------------------------------------------
Query:
ACC
Original database:
GCATCGC
Sorted suffix array:
{2: 'ATCGC', 6: 'C', 1: 'CATCGC', 4: 'CGC', 5: 'GC', 0: 'GCATCGC', 3: 'TCGC'}
----------------------------------------------------------------------------------------------------
No match found for query ACC.
----------------------------------------------------------------------------------------------------
[]
= "C"
query query_suffix_array(query, db)
----------------------------------------------------------------------------------------------------
Query:
C
Original database:
GCATCGC
Sorted suffix array:
{2: 'ATCGC', 6: 'C', 1: 'CATCGC', 4: 'CGC', 5: 'GC', 0: 'GCATCGC', 3: 'TCGC'}
----------------------------------------------------------------------------------------------------
Query "C" has a match at positions in the sorted suffix array: [6: 4]
Matched suffixes: ['C', 'CATCGC', 'CGC']
----------------------------------------------------------------------------------------------------
['C', 'CATCGC', 'CGC']
= "AT"
query query_suffix_array(query, db)
----------------------------------------------------------------------------------------------------
Query:
AT
Original database:
GCATCGC
Sorted suffix array:
{2: 'ATCGC', 6: 'C', 1: 'CATCGC', 4: 'CGC', 5: 'GC', 0: 'GCATCGC', 3: 'TCGC'}
----------------------------------------------------------------------------------------------------
Query "AT" has a match at positions in the sorted suffix array: [2: 2]
Matched suffixes: ['ATCGC']
----------------------------------------------------------------------------------------------------
['ATCGC']
References & Credits:
1. Advance Data Structures: Suffix Arrays
1. Advance Data Structures: Suffix Array Search