from collections import Counter
import pandas as pd
Burrows-Wheeler Transform in Python
Burrows-Wheeler Transform is a method used for pattern matching problem. We can use it to find all instances of substrings in the database that matches our query sequence.
I recently learnt about Burrows-Wheeler Transform from the Advanced Data Structure playlist by Niema Moshiri. In this notebook, I tried to implement it using python.
= 'BANANA$' db
Extracting Burrows-Wheeler Transform (BWT)
= []
rotations
rotations.append(db)= db
s for i in range(len(db)-1):
= s[-1] + s[:-1]
s rotations.append(s)
'rotations':rotations, 'sorted_rotations': sorted(rotations)}) pd.DataFrame({
rotations | sorted_rotations | |
---|---|---|
0 | BANANA$ | $BANANA |
1 | $BANANA | A$BANAN |
2 | A$BANAN | ANA$BAN |
3 | NA$BANA | ANANA$B |
4 | ANA$BAN | BANANA$ |
5 | NANA$BA | NA$BANA |
6 | ANANA$B | NANA$BA |
The Burrows - Wheeler Transform (BWT) is the last characters from the list of sorted rotations.
= [s[-1] for s in sorted(rotations)]
BWT BWT
['A', 'N', 'N', 'B', '$', 'A', 'A']
Inverting the Burrows-Wheeler Transform (BWT)
Burrows-Wheeler transform is reversable. Here’s how to construct the original string from BWT:
Step 1: List out BWT as last column and sorted list of the original database as first column
= pd.DataFrame({'first': sorted(db), 'last': BWT})
df df
first | last | |
---|---|---|
0 | $ | A |
1 | A | N |
2 | A | N |
3 | A | B |
4 | B | $ |
5 | N | A |
6 | N | A |
Step 2: Add the number of times we’ve seen the character at that point as subscript
### adding subscript for first column (sorted original database)
= Counter(db)
dbcounter
= []
sorted_db_subs for c in sorted(db, reverse=True):
+ str(dbcounter[c]))
sorted_db_subs.append(c = dbcounter[c] - 1
dbcounter[c]
= sorted(sorted_db_subs)
sorted_db_subs sorted_db_subs
['$1', 'A1', 'A2', 'A3', 'B1', 'N1', 'N2']
### adding subscript for last column (BWT)
= Counter(BWT)
dbcounter print(dbcounter)
= []
bwt_subs for c in BWT[::-1]:
+ str(dbcounter[c]))
bwt_subs.append(c = dbcounter[c] - 1
dbcounter[c]
= bwt_subs[::-1] bwt_subs
Counter({'A': 3, 'N': 2, 'B': 1, '$': 1})
'first_sub'] = sorted_db_subs
df['last_sub'] = bwt_subs df[
df
first | last | first_sub | last_sub | |
---|---|---|---|---|
0 | $ | A | $1 | A1 |
1 | A | N | A1 | N1 |
2 | A | N | A2 | N2 |
3 | A | B | A3 | B1 |
4 | B | $ | B1 | $1 |
5 | N | A | N1 | A2 |
6 | N | A | N2 | A3 |
Step 3: Add last-to-first mapping
= {k:v for k,v in zip(df['first_sub'].values, range(len(df)))}
MAPPING MAPPING
{'$1': 0, 'A1': 1, 'A2': 2, 'A3': 3, 'B1': 4, 'N1': 5, 'N2': 6}
'L2F'] = df.apply(lambda row: MAPPING[row['last_sub']],axis=1)
df[ df
first | last | first_sub | last_sub | L2F | |
---|---|---|---|---|---|
0 | $ | A | $1 | A1 | 1 |
1 | A | N | A1 | N1 | 5 |
2 | A | N | A2 | N2 | 6 |
3 | A | B | A3 | B1 | 4 |
4 | B | $ | B1 | $1 | 0 |
5 | N | A | N1 | A2 | 2 |
6 | N | A | N2 | A3 | 3 |
We can reconstruct our original database string from this preprocessed data. Here’s how:
We can start with the null termination character we added, we know it is in the first row, first_sub
column, so we retrieve that first.
Remember that as part of our preprocessing steps we do rotations of the original database string, and from that rotations we save the first column in first_sub
and the last column in last_sub
? So this tells us that first_sub
and last_sub
are connected: last_sub
character comes before first_sub
character.
To illustrate:
1. Start with null termination $1
.
2. Find null termination $1
in last_sub
(row 4). The next character follows it is the corresponding character in first_sub
column (red arrow). Now the reconstructed string is $1 B1
.
3. Find the last added character in the last_sub
. Add its corresponding first_sub
to the reconstructed string (blue arrow). Now the reconstructed string is $1 B1 A3
.
4. Continue building the reconstructed string until we reached the character last_sub
of first row. Why first row? because we started the reconstructed string with $1
– which is also the first_sub
character in our preprocessed data’s first row.
= [df.loc[0,'first_sub']]
reconstruct
while reconstruct[-1] != df.loc[0,'last_sub']:
= reconstruct[-1]
last_char = df[df['last_sub']==last_char].index.values[0]
next_index = df.loc[df.index==next_index, 'first_sub'].values[0]
next_char
reconstruct.append(next_char)
= ' '.join(reconstruct)
reconstructed_str_sub = ''.join([c[0] for c in reconstruct])
reconstructed_str
print('Reconstructed database string (with subscript):',reconstructed_str_sub)
print('Reconstructed database string:', reconstructed_str)
print('Original database string:', db)
Reconstructed database string (with subscript): $1 B1 A3 N2 A2 N1 A1
Reconstructed database string: $BANANA
Original database string: BANANA$
We then move the null termination character to the end of the string and we can get back the original database.
Pattern Matching using BWT
Let’s try to do pattern matching using the preprocessed BWT from our toy database.
= 'AN'
query
= range(len(df))
ok_range for c in query[::-1]:
= df.loc[df.index.isin(ok_range) & (df['last'] == c), 'L2F'].head(1).values[0]
top = df.loc[df.index.isin(ok_range) & (df['last'] == c), 'L2F'].tail(1).values[0]
bottom = range(top, bottom+1)
ok_range print('Found character:', c, '\n', df[df.index.isin(ok_range)])
print('-'*40)
# exact match to query sequence is located between the top:bottom position of the preprocessed database
= df.loc[df.index.isin(ok_range), 'first_sub'].values
query_result print(f'Query "{query}" found in database sequence starting from:', query_result)
print('-'*40)
Found character: N
first last first_sub last_sub L2F
5 N A N1 A2 2
6 N A N2 A3 3
----------------------------------------
Found character: A
first last first_sub last_sub L2F
2 A N A2 N2 6
3 A B A3 B1 4
----------------------------------------
Query "AN" found in database sequence starting from: ['A2' 'A3']
----------------------------------------
The query_result
does indeed match.
- $1 B1 A3 N2 A2 N1 A1
- $1 B1 A3 N2 A2 N1 A1
Let’s make this into a function and try another example.
def reconstruct_db(df:pd.DataFrame) -> list:
"""Reconstruct database string using preprocessed data"""
= [df.loc[0,'first_sub']]
reconstruct
while reconstruct[-1] != df.loc[0,'last_sub']:
= reconstruct[-1]
last_char = df[df['last_sub']==last_char].index.values[0]
next_index = df.loc[df.index==next_index, 'first_sub'].values[0]
next_char
reconstruct.append(next_char)return reconstruct
def extract_bwt(db:str) -> list:
"""Extract Burrow-Wheeler Transform from database string"""
= []
rotations
rotations.append(db)= db
s for i in range(len(db)-1):
= s[-1] + s[:-1]
s
rotations.append(s)
= [s[-1] for s in sorted(rotations)]
BWT return BWT
def preprocess_database(db:str) -> pd.DataFrame:
"""Preprocess data from database string"""
= extract_bwt(db)
BWT = pd.DataFrame({'first': sorted(db), 'last': BWT})
df
### adding subscript for first column (sorted original database)
= Counter(db)
dbcounter = []
sorted_db_subs for c in sorted(db, reverse=True):
+ str(dbcounter[c]))
sorted_db_subs.append(c = dbcounter[c] - 1
dbcounter[c] = sorted(sorted_db_subs)
sorted_db_subs
### adding subscript for last column (BWT)
= Counter(BWT)
dbcounter = []
bwt_subs for c in BWT[::-1]:
+ str(dbcounter[c]))
bwt_subs.append(c = dbcounter[c] - 1
dbcounter[c] = bwt_subs[::-1]
bwt_subs
'first_sub'] = sorted_db_subs
df['last_sub'] = bwt_subs
df[
= {k:v for k,v in zip(df['first_sub'].values, range(len(df)))}
MAPPING 'L2F'] = df.apply(lambda row: MAPPING[row['last_sub']],axis=1)
df[return df
def query_bwt(query:str, df:pd.DataFrame) -> list:
"""Perform query matching against preprocessed data using BWT"""
= '\033[1m'
BOLD = '\033[0m'
END_BOLD
= reconstruct_db(df)
reconstruct = range(len(df))
ok_range for c in query[::-1]:
= df.loc[df.index.isin(ok_range) & (df['last'] == c), 'L2F'].head(1)
top_row = df.loc[df.index.isin(ok_range) & (df['last'] == c), 'L2F'].tail(1)
bottom_row if len(top_row) > 0 and len(bottom_row) > 0:
= top_row.values[0]
top = bottom_row.values[0]
bottom else:
print(f'Query "{query}" not found in database.')
return []
= range(top, bottom+1)
ok_range print('Found character:', c, '\n', df[df.index.isin(ok_range)])
print('-'*40)
# exact match to query sequence is located between the top:bottom position of the preprocessed database
= df.loc[df.index.isin(ok_range), 'first_sub'].values
query_result print(f'Query "{query}" found in database sequence starting from:', query_result)
print()
for i,q in enumerate(query_result):
print(i+1, ')\t', *[BOLD+r+END_BOLD if r == q else r for r in recons])
print()
print('-'*40)
return query_result
= 'NA'
query
= preprocess_database("BANANA$")
df2 query_bwt(query, df2)
Found character: A
first last first_sub last_sub L2F
1 A N A1 N1 5
2 A N A2 N2 6
3 A B A3 B1 4
----------------------------------------
Found character: N
first last first_sub last_sub L2F
5 N A N1 A2 2
6 N A N2 A3 3
----------------------------------------
Query "NA" found in database sequence starting from: ['N1' 'N2']
1 ) $1 B1 A3 N2 A2 N1 A1
2 ) $1 B1 A3 N2 A2 N1 A1
----------------------------------------
array(['N1', 'N2'], dtype=object)
= 'NAN'
query
= preprocess_database("BANANA$")
df2 query_bwt(query, df2)
Found character: N
first last first_sub last_sub L2F
5 N A N1 A2 2
6 N A N2 A3 3
----------------------------------------
Found character: A
first last first_sub last_sub L2F
2 A N A2 N2 6
3 A B A3 B1 4
----------------------------------------
Found character: N
first last first_sub last_sub L2F
6 N A N2 A3 3
----------------------------------------
Query "NAN" found in database sequence starting from: ['N2']
1 ) $1 B1 A3 N2 A2 N1 A1
----------------------------------------
array(['N2'], dtype=object)
= 'NANN'
query
= preprocess_database("BANANA$")
df2 query_bwt(query, df2)
Found character: N
first last first_sub last_sub L2F
5 N A N1 A2 2
6 N A N2 A3 3
----------------------------------------
Query "NANN" not found in database.
[]
References & Credits:
1. Advance Data Structures: Burrows-Wheeler Transform (BWT)
1. Advance Data Structures: Inverting the BWT
1. Advance Data Structures: Pattern Matching Using the BWT
1. StackOverflow: How to print bold text in python