how to filter sequence database with bash?
2
0
Entering edit mode
3.3 years ago

Hello, I am trying to filter a FASTA sequence database using bash.

>122051856 Abies alba
ACAACATACGCCCTCCCTGCAAGTTTAGAGGGGAGGAGCGGACATGGTCGTCCGTGCCCATCGTGGTGCGGTTGGCTGAAATGCATTTGATGTCCCCTGCCTTGCATCGGTCAGCGGTGGCCTT
>638110106 Acer campestre
ATCGTTGCCCCCCCCTCCGAAACCCCTCTCCTCTCCTCTCGAAAGAGAGAGACGGATGGGACTTGGGTGTGGGCGGATATTGGCCTCCCGTGGGCCGAACGGCTCGCGGTTGGCCTAAATTTGAG
>638110119 Acer campestre
ATCGTTGCCCCCCCCTCCGAAACCCCTCTCCTCTCCTCTCGAAAGAGAGAGACGGATGGGACTTGGGTGTGGGCGGATATTGGCCTCCCGTGGGCCGAACGGCTCGCGGTTGGCCTAAATTTGAGT
>49823319 Achillea millefolium subsp. sudetica
ATCGCGTCGCCCCCAACAAATATCTGTTGGGGGCGGATATTGGTCTCCCGTGCTCATGGTGTGGTTGGCCAAAATAAGAGTCCCTTCGATGGACACACGAACTAGTGGTGGTCGTAAAAACCCT
>49689252 Fumaria officinalis
ACGCACCGAGTCGCCCCCACCCGCCCCCCAAGAGGTGCCGCGGGAGGGAGCGGAGAATGGCCCCCCGTGCCCCAGCGCGCGGCCGGCCCAAACACAGGCCCCGGGAGGCCGGCGTCACGAT
...

It's a plant database and I want to filter it with a list of plants:

Abies alba  
Acer campestre  
Achillea millefolium subsp. sudetica
...

This would be the result, I need:

>122051856 Abies alba
ACAACATACGCCCTCCCTGCAAGTTTAGAGGGGAGGAGCGGACATGGTCGTCCGTGCCCATCGTGGTGCGGTTGGCTGAAATGCATTTGATGTCCCCTGCCTTGCATCGGTCAGCGGTGGCCTT
>638110106 Acer campestre
ATCGTTGCCCCCCCCTCCGAAACCCCTCTCCTCTCCTCTCGAAAGAGAGAGACGGATGGGACTTGGGTGTGGGCGGATATTGGCCTCCCGTGGGCCGAACGGCTCGCGGTTGGCCTAAATTTGAG
>638110119 Acer campestre
ATCGTTGCCCCCCCCTCCGAAACCCCTCTCCTCTCCTCTCGAAAGAGAGAGACGGATGGGACTTGGGTGTGGGCGGATATTGGCCTCCCGTGGGCCGAACGGCTCGCGGTTGGCCTAAATTTGAGT
>49823319 Achillea millefolium subsp. sudetica
ATCGCGTCGCCCCCAACAAATATCTGTTGGGGGCGGATATTGGTCTCCCGTGCTCATGGTGTGGTTGGCCAAAATAAGAGTCCCTTCGATGGACACACGAACTAGTGGTGGTCGTAAAAACCCT

I already tried

grep -Ff list.txt database.txt > filtered.txt

Therefore I created a list with the ID lines from the database and aligned it with the list of the plants. With this command I appended the matching sequences to the result.

grep -x -F -A 1 -f 'filtered.txt' 'database.fasta' > filtered_database.fasta

As it is very huge databse that I want to filter and some plants have to occur multiple times due to the various tax-IDs (e.g. Acer campestre), I am not sure, if that is the right way and if I got all the sequences from the list...

Are there any other possibilities to filter this FASTA database with a list of binary nomenclature names with bash?

Thank you very much!

Greetings, Lisa

sequencing database bash FASTA grep • 932 views
ADD COMMENT
1
Entering edit mode
3.3 years ago

If you are completely sure that your sequence database contains pairs of id+sequence lines, you could combine your both greps in 1. You'll get all the IDs that match your query even if they are repeated, plus you'll have all the next-match-line sequences at once, getting the desired results in a single and fast step:

grep --no-group-separator -F -w -A1 -f list.txt database.txt > filtered_database.fasta

I've added the -w option to find your patterns not only as fixed (-F), but as whole words too. Also, since -- lines will appear to separate groups of matches by default, you may avoid them with --no-group-separator.

ADD COMMENT
0
Entering edit mode

Thank you very much! This helps a lot. And yes, the -- lines appeared before, so thank you for your advice :-)

ADD REPLY
0
Entering edit mode
3.3 years ago

linearize, grep, convert back to fasta.

ADD COMMENT

Login before adding your answer.

Traffic: 2179 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6