Question

How does Burrows-Wheeler Transformation work on genetic sequences?

1

Entering edit mode

6.1 years ago

gregoryhuang2005 ▴ 10

Hi -

So I'm just wondering how exactly the BWT works in creating "runs" of the same character in the last column after sorting the cycled string. Is it because of some frequency of nucleotides occurring conditionally that I'm not aware of? I get that in the English language I'm pretty sure that there are cases where a character is more likely to appear after another character (that's how BWT's effectiveness is explained in all texts i've found so far), but does this also apply to nucleotides?

Burrows-Wheeler Transform Sequences Compression • 1.5k views

ADD COMMENT • link updated 6.1 years ago by kloetzl ★ 1.1k • written 6.1 years ago by gregoryhuang2005 ▴ 10

score 4 · Accepted Answer · 2018-03-27

4

Entering edit mode

6.1 years ago

kloetzl ★ 1.1k

You are right, the BWT will have a hard time compressing random sequences of nucleotides as (uniform) random data is by definition hard to compress. However, genetic sequences are far from random. Just think of GC-bias, codon-bias, motifs, patterns in various forms (promotors, TATA-box), duplications, …. All of these reduce the "randomness" (entropy) of the data and instead increase the repetitiveness which the BWT then can exploit.