Question

how to filter sequence database with bash?

0

Entering edit mode

3.3 years ago

Lisa Prudnikow • 0

Hello, I am trying to filter a FASTA sequence database using bash.

>122051856 Abies alba
ACAACATACGCCCTCCCTGCAAGTTTAGAGGGGAGGAGCGGACATGGTCGTCCGTGCCCATCGTGGTGCGGTTGGCTGAAATGCATTTGATGTCCCCTGCCTTGCATCGGTCAGCGGTGGCCTT
>638110106 Acer campestre
ATCGTTGCCCCCCCCTCCGAAACCCCTCTCCTCTCCTCTCGAAAGAGAGAGACGGATGGGACTTGGGTGTGGGCGGATATTGGCCTCCCGTGGGCCGAACGGCTCGCGGTTGGCCTAAATTTGAG
>638110119 Acer campestre
ATCGTTGCCCCCCCCTCCGAAACCCCTCTCCTCTCCTCTCGAAAGAGAGAGACGGATGGGACTTGGGTGTGGGCGGATATTGGCCTCCCGTGGGCCGAACGGCTCGCGGTTGGCCTAAATTTGAGT
>49823319 Achillea millefolium subsp. sudetica
ATCGCGTCGCCCCCAACAAATATCTGTTGGGGGCGGATATTGGTCTCCCGTGCTCATGGTGTGGTTGGCCAAAATAAGAGTCCCTTCGATGGACACACGAACTAGTGGTGGTCGTAAAAACCCT
>49689252 Fumaria officinalis
ACGCACCGAGTCGCCCCCACCCGCCCCCCAAGAGGTGCCGCGGGAGGGAGCGGAGAATGGCCCCCCGTGCCCCAGCGCGCGGCCGGCCCAAACACAGGCCCCGGGAGGCCGGCGTCACGAT
...

It's a plant database and I want to filter it with a list of plants:

Abies alba  
Acer campestre  
Achillea millefolium subsp. sudetica
...

This would be the result, I need:

>122051856 Abies alba
ACAACATACGCCCTCCCTGCAAGTTTAGAGGGGAGGAGCGGACATGGTCGTCCGTGCCCATCGTGGTGCGGTTGGCTGAAATGCATTTGATGTCCCCTGCCTTGCATCGGTCAGCGGTGGCCTT
>638110106 Acer campestre
ATCGTTGCCCCCCCCTCCGAAACCCCTCTCCTCTCCTCTCGAAAGAGAGAGACGGATGGGACTTGGGTGTGGGCGGATATTGGCCTCCCGTGGGCCGAACGGCTCGCGGTTGGCCTAAATTTGAG
>638110119 Acer campestre
ATCGTTGCCCCCCCCTCCGAAACCCCTCTCCTCTCCTCTCGAAAGAGAGAGACGGATGGGACTTGGGTGTGGGCGGATATTGGCCTCCCGTGGGCCGAACGGCTCGCGGTTGGCCTAAATTTGAGT
>49823319 Achillea millefolium subsp. sudetica
ATCGCGTCGCCCCCAACAAATATCTGTTGGGGGCGGATATTGGTCTCCCGTGCTCATGGTGTGGTTGGCCAAAATAAGAGTCCCTTCGATGGACACACGAACTAGTGGTGGTCGTAAAAACCCT

I already tried

grep -Ff list.txt database.txt > filtered.txt

Therefore I created a list with the ID lines from the database and aligned it with the list of the plants. With this command I appended the matching sequences to the result.

grep -x -F -A 1 -f 'filtered.txt' 'database.fasta' > filtered_database.fasta

As it is very huge databse that I want to filter and some plants have to occur multiple times due to the various tax-IDs (e.g. Acer campestre), I am not sure, if that is the right way and if I got all the sequences from the list...

Are there any other possibilities to filter this FASTA database with a list of binary nomenclature names with bash?

Thank you very much!

Greetings, Lisa

sequencing database bash FASTA grep • 932 views

ADD COMMENT • link updated 3.3 years ago by Jorge Amigo 14k • written 3.3 years ago by Lisa Prudnikow • 0

score 1 · Answer 1 · 2021-01-04

If you are completely sure that your sequence database contains pairs of id+sequence lines, you could combine your both greps in 1. You'll get all the IDs that match your query even if they are repeated, plus you'll have all the next-match-line sequences at once, getting the desired results in a single and fast step:

grep --no-group-separator -F -w -A1 -f list.txt database.txt > filtered_database.fasta

I've added the -w option to find your patterns not only as fixed (-F), but as whole words too. Also, since -- lines will appear to separate groups of matches by default, you may avoid them with --no-group-separator.

score 0 · Answer 2 · 2021-01-04

0

Entering edit mode

3.3 years ago

Pierre Lindenbaum 161k

linearize, grep, convert back to fasta.

ADD COMMENT • link 3.3 years ago by Pierre Lindenbaum 161k