Question

Extract out fasta sequences using sequence headers.

1

Entering edit mode

7.0 years ago

a.rex ▴ 350

I usually extract out fasta sequences using samtools:

i.e to extract the sequences for gene 000001

samtools faidx /path/to/transcriptome 000001

However, I was wondering whether there was a better method for extracting isoform sequences. I have tried the following command, but to no avail, to extract the sequences for gene isoforms 000001.1, 000001.2, 000001.3:

samtools faidx /path/to/transcriptome 000001.*

Does anyone have any tips on how to do this effectively?

gene • 1.6k views

ADD COMMENT • link updated 7.0 years ago by Jake Warner ▴ 830 • written 7.0 years ago by a.rex ▴ 350

1

Entering edit mode

BBMap's filterbyname tool will work like this:

filterbyname.sh in=transcriptome.fa out=filtered.fa include names=000001. substring=name

ADD REPLY • link 7.0 years ago by Brian Bushnell 20k

score 1 · Answer 1 · 2017-05-09

1

Entering edit mode

7.0 years ago

Jake Warner ▴ 830

You can do this with Awk :

awk '/'000001.*'/{flag=1;print $0;next}/^>/{flag=0}flag' file.fasta >> outfile.fasta

ADD COMMENT • link 7.0 years ago by Jake Warner ▴ 830