Hi all, I have two sets of data (let's say A & B ) each includes 1000,000 sequences. I would like to see how many of sequences of one file (e.g. A) has at lease 90% matches to the sequences in the other file (e.g. B). I have already wrote a script which take one sequence from A and do pairwise alignment to all sequences in B and store the score for each matches. (I have used Biopython module) My problem is, the script is very slow and still it did not finish after 4 days run on 21 cores. Do you guys have any idea that how can I improve the script? or is there any better way to do this massive pariwise alignment?
Thanks
Instead of re-inventing the wheel use one of the known MSA programs (clustal, muscle, t-coffee, mafftt). How long are these sequences? A MSA of a million sequences may be tough to view/decipher. Can you de-dupe your data before doing MSA?
Those softwares are deigned for multiple sequence alignment. right? But I dont want to do MSA. seq size ranges btw 100-300 bp.
My bad. Perhaps using an NGS aligner would still be much faster. I suggest looking at bbmap.sh from BBMap suite. Let me try a small test.
Actually these sequences are not related to RNA-seq. I am trying to find unique sequences among a dataset with a million sequence then create a distance matrix based on that.
That is ok. NGS aligners are going to be inherently fast compared to other alignment programs, as long as they can handle fasta format sequences (which bbmap can).
This part is very easily done with
clumpify.sh
. It will be insanely fast. You can allow for errors (subs=
option). See this thread for usage information: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files You can useclumpify
with fasta sequences. Where ever the commands sayfq.gz
you can replace those with.fa
for your files.that I am not sure how to do right away.
I'm also puzzled by the 'distance matric' part. Is it correct to assume you mean 'distance matrix' as in phylogenetic context?
You want to create it based on the unique sequences in your dataset? What do you consider ''unique' , not otherwise present in the dataset or less than 90% identical/similar?