Question

massive pairwise sequence alignment

1

Entering edit mode

6.3 years ago

mkh ▴ 60

Hi all, I have two sets of data (let's say A & B ) each includes 1000,000 sequences. I would like to see how many of sequences of one file (e.g. A) has at lease 90% matches to the sequences in the other file (e.g. B). I have already wrote a script which take one sequence from A and do pairwise alignment to all sequences in B and store the score for each matches. (I have used Biopython module) My problem is, the script is very slow and still it did not finish after 4 days run on 21 cores. Do you guys have any idea that how can I improve the script? or is there any better way to do this massive pariwise alignment?

Thanks

alignment sequence • 2.0k views

ADD COMMENT • link updated 6.2 years ago by Joe 21k • written 6.3 years ago by mkh ▴ 60

0

Entering edit mode

Instead of re-inventing the wheel use one of the known MSA programs (clustal, muscle, t-coffee, mafftt). How long are these sequences? A MSA of a million sequences may be tough to view/decipher. Can you de-dupe your data before doing MSA?

ADD REPLY • link 6.3 years ago by GenoMax 142k

0

Entering edit mode

Those softwares are deigned for multiple sequence alignment. right? But I dont want to do MSA. seq size ranges btw 100-300 bp.

ADD REPLY • link 6.3 years ago by mkh ▴ 60

0

Entering edit mode

My bad. Perhaps using an NGS aligner would still be much faster. I suggest looking at bbmap.sh from BBMap suite. Let me try a small test.

ADD REPLY • link 6.3 years ago by GenoMax 142k

0

Entering edit mode

Actually these sequences are not related to RNA-seq. I am trying to find unique sequences among a dataset with a million sequence then create a distance matrix based on that.

ADD REPLY • link 6.3 years ago by mkh ▴ 60

1

Entering edit mode

That is ok. NGS aligners are going to be inherently fast compared to other alignment programs, as long as they can handle fasta format sequences (which bbmap can).

I am trying to find unique sequences among a dataset with a million sequence

This part is very easily done with clumpify.sh. It will be insanely fast. You can allow for errors (subs= option). See this thread for usage information: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files You can use clumpify with fasta sequences. Where ever the commands say fq.gz you can replace those with .fa for your files.

then create a distance matrix based on that

that I am not sure how to do right away.

ADD REPLY • link 6.3 years ago by GenoMax 142k

0

Entering edit mode

I'm also puzzled by the 'distance matric' part. Is it correct to assume you mean 'distance matrix' as in phylogenetic context?

You want to create it based on the unique sequences in your dataset? What do you consider ''unique' , not otherwise present in the dataset or less than 90% identical/similar?

ADD REPLY • link 6.3 years ago by lieven.sterck 15k

score 1 · Answer 1 · 2018-02-20

I am answering my question, since It may help people with similar problem. The fast massive alignment is possible by applying Bowie. When you do alignment using Bowtie2 it will tell you how many sequences are repeated. And you can delete the repetitive sequences by extracting it from sam file (Bowtie output)

score 0 · Answer 2 · 2018-02-06

0

Entering edit mode

6.3 years ago

lieven.sterck 15k

"massive pairwise alignment" ? makes me think of blast ! :-)

I thus suggest to do a blast search and parse the resulting output file.

ADD COMMENT • link 6.3 years ago by lieven.sterck 15k

0

Entering edit mode

Thanks. I have already tried blast and the output is different than pairwise alignment!!

ADD REPLY • link 6.3 years ago by mkh ▴ 60

0

Entering edit mode

Yes it might be, you could benefit from tweaking the blast parameters a little (eg. do ungapped blast).

I do wonder however how different the result are (can be?) between these two approaches, despite that I'm very well aware from the fact blast is not actually an alignment tool. It will nonetheless be able to answer your question if that is to look for >90% identity.

Are we talking DNA or protein here btw?

ADD REPLY • link 6.3 years ago by lieven.sterck 15k

score 0 · Answer 3 · 2018-02-20

0

Entering edit mode

6.2 years ago

Joe 21k

Something else you could do is use the CD-HIT suite to cluster the sequences to a particular stringency.

I don’t know how it will handle that many sequences specifically, but it’s usually quite fast I think.

ADD COMMENT • link 6.2 years ago by Joe 21k