Hi guys,
I wanted to extract specific sequences out of a fasta file based on its accession number. Therefore I used bbmap filterbyname.sh
and at the first view it looked perfectly fine. I run this command:
filterbyname.sh in=DB.fasta out=SequencesOfInterest.fasta names=Accessions_DB.txt include=t fixjunk
My name
file contains 9499 accessions but for some reason bbmap gives me 9504 sequences in the output file. I picked random sequences and double checked if it worked properly and so far it did. But is it possible that some accessions are twice in the output? If so, how can I identify them? Or did I miss an important parameter in my command? I never used this script before.
Thanks in advance for your help!
Cheers
Have you checked to see if you have some accessions twice in your input/source file? If you post examples of a couple of your fasta headers we can tell you how to extract/sort and identify if they are unique.
The format of the accessions is varies a lot, depending on how the data was generated. e.g.
Does that help?
You would find that many other programs would be defeated by those spaces in the fasta headers. Looks like BBMap did reasonably well. I would still suggest checking if your input had duplicate fasta headers (if not actual sequences).
You could try
grep "^>" | sort | wc -l
and thengrep "^>" | sort| uniq -c
on input file and see if those numbers match. This process may need a significant amount of memory and may crash.Great, thanks for your advice!