Suffix Array

bioinformatics
data structure
Published

November 6, 2024

db = "GCATCGC"

Suffix Array

One motivation behind suffix array is that any position in our database that is a match to any substring of our query sequence can be thought of as the beginning of a suffix.

image.png

To illustrate:
If we wanted to query the sequence C in database of GCATCGC then there would be 3 matching suffixes.
1. One suffix starts from the first C character at index 1.
2. The second suffix starts from the second C character at index 4.
3. The third suffix starts from the third C character at index 6.

To be able to do this, we first need to create an array of suffixes starting at each character of our database, then sort the array of suffixes alphabetically.

suffixes = {i:db[i:] for i,c in enumerate(db)}
print('Suffixes:\t\t', suffixes)
Suffixes:        {0: 'GCATCGC', 1: 'CATCGC', 2: 'ATCGC', 3: 'TCGC', 4: 'CGC', 5: 'GC', 6: 'C'}
sorted_suffixes = {k: v for k, v in sorted(suffixes.items(), key=lambda item: item[1])}
sorted_suffix_keys = list(sorted_suffixes.keys())
print('Sorted Suffixes:\t', sorted_suffixes)
print('Sorted Suffix Keys:\t', sorted_suffix_keys)
Sorted Suffixes:     {2: 'ATCGC', 6: 'C', 1: 'CATCGC', 4: 'CGC', 5: 'GC', 0: 'GCATCGC', 3: 'TCGC'}
Sorted Suffix Keys:  [2, 6, 1, 4, 5, 0, 3]