Question

Bbmap filterbyname.sh to extract sequences out of a fasta file - wrong output

0

Entering edit mode

4.6 years ago

marlenejensen ▴ 20

Hi guys,

I wanted to extract specific sequences out of a fasta file based on its accession number. Therefore I used bbmap filterbyname.sh and at the first view it looked perfectly fine. I run this command:

filterbyname.sh in=DB.fasta out=SequencesOfInterest.fasta names=Accessions_DB.txt include=t fixjunk

My name file contains 9499 accessions but for some reason bbmap gives me 9504 sequences in the output file. I picked random sequences and double checked if it worked properly and so far it did. But is it possible that some accessions are twice in the output? If so, how can I identify them? Or did I miss an important parameter in my command? I never used this script before.

Thanks in advance for your help!

Cheers

filterbyname.sh bbmap • 4.0k views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 4.6 years ago by marlenejensen ▴ 20

0

Entering edit mode

Have you checked to see if you have some accessions twice in your input/source file? If you post examples of a couple of your fasta headers we can tell you how to extract/sort and identify if they are unique.

ADD REPLY • link 4.6 years ago by GenoMax 142k

0

Entering edit mode

The format of the accessions is varies a lot, depending on how the data was generated. e.g.

>Delta1_2004206643 orf2_0004_16 Aldehyde ferredoxin oxidoreductase [O.algarvensis Delta1] # COG COG2414(C)

>Host_265335_c1_seq1_5 [136 - 2] (REVERSE SENSE) len=239 path=[217 0-135 353 136-238]

>Delta3_G2_84585.peg.319 scaffold_0 Molybdopterin biosynthesis protein MoeA / Periplasmic molybdate-binding domain

Does that help?

ADD REPLY • link updated 4.6 years ago by GenoMax 142k • written 4.6 years ago by marlenejensen ▴ 20

1

Entering edit mode

You would find that many other programs would be defeated by those spaces in the fasta headers. Looks like BBMap did reasonably well. I would still suggest checking if your input had duplicate fasta headers (if not actual sequences).

You could try grep "^>" | sort | wc -l and then grep "^>" | sort| uniq -c on input file and see if those numbers match. This process may need a significant amount of memory and may crash.