Question

How to grep fasta sequence using list of IDs

0

Entering edit mode

2.9 years ago

Kumar ▴ 170

Hi,

I have a file of fasta sequence and an another file list of its IDs. I am trying to extract sequences from the list of IDs. I tried the following commands but not getting any output. I would appreciate if I get any solutions.

Tried these commands:
$seqtk subseq test.fa test.txt 
$grep -A 1 -wFf list.txt sequences.fas > newfile2.fas
$for i in $(cut -d" " -f1- file2); do grep -o "$i" file1 | tee -a result.txt; done

Example: File 1:

>SEGI_09259
>SEGI_10011
>SEGI_06629

File 2:
 >SEGI_07257  
    MKICGWLYHFKFSKNMQGKVVLIIGL       
 >SEGI_10011    
    MNNCCFMVMRLGGSRSTGRGLKSSEAGE
  >SEGI_06629    
    MGVGIVKSLAGFMLLLNFCMYMTVAGIAG
    MAVGIVK

Output:
>SEGI_10011    
 MNNCCFMVMRLGGSRSTGRGLKSSEAGE
>SEGI_06629    
MGVGIVKSLAGFMLLLNFCMYMTVAGIAG
MAVGIVK

grep FASTA Sequence • 1.4k views

ADD COMMENT • link 2.9 years ago by Kumar ▴ 170

score 0 · Answer 1 · 2021-05-19

0

Entering edit mode

2.9 years ago

GenoMax 141k

See : How do I extract Fasta Sequences based on a list of IDs?

My recommendation is to use faSomeRecords from Jim Kent linked in the answer above.

seqkit based answer (from How can I pull out specific protein fastas from one file using information from the protein header? ). Will work for any fasta :

seqkit -w 0 grep -nr -f ids.txt test.fa