Extracting subsequence from FASTA using Samtools faidx not working
1
1
Entering edit mode
8.2 years ago
hmg ▴ 20

Hello,

I have a FASTA file, and I want to extract a subsequence from it.

>first
AGCATGCATCGTACGATCGTACGATCATCATC
GTACGATCGATCGATCGATCGATCGTGTGCTA
GACGATCGATCGATCGA
>second
GTACGATGCATCGATCGATCGATCGATCGATC
GTGATCGTGTCGTCGTCGTCGATAAGCATCGT
AGTACGATCGATCGTA

I know I can do this with samtools:

samtools faidx seq.fa first:1-5

output is:

>first:1-5
AGCAT

However, my actual sequence names are much more complicated, and anything other than one word doesn't seem to work.

If seq2.fa is:

>first seq
AGCATGCATCGTACGATCGTACGATCATCATC
GTACGATCGATCGATCGATCGATCGTGTGCTA
GACGATCGATCGATCGA
>second seq 
GTACGATGCATCGATCGATCGATCGATCGATC
GTGATCGTGTCGTCGTCGTCGATAAGCATCGT
AGTACGATCGATCGTA

Then the output of:

samtools faidx seq2.fa first:1-5

>first:1-5
seqAG

​```​ samtools faidx seq2.fa first seq:1-5

first seqAGCATGCATCGTACGATCGTACGATCATCATCGTACGATCGATCGATCGATCGATCG TGTGCTAGACGATCGATCGAT seq:1-5 ```

samtools faidx seq2.fa 'first seq':1-5

>first seq:1-5

Also, if my sequence description contains a number (like seq3.fa):

>seq123
AGCATGCATCGTACGATCGTACGATCATCATC
GTACGATCGATCGATCGATCGATCGTGTGCTA
GACGATCGATCGATCGA
>seq456
GTACGATGCATCGATCGATCGATCGATCGATC
GTGATCGTGTCGTCGTCGTCGATAAGCATCGT
AGTACGATCGATCGTA
samtools faidx seq3.fa seq123:1-5

>seq123:1-5

Is there a way to use samtools to extract a subsequence when the descriptor is more complicated than just a word? I've tried with samtools 0.1.18 and 0.1.19.

Thanks

faidx fasta samtools • 14k views
ADD COMMENT
4
Entering edit mode
8.2 years ago

faidx stops reading the name after the 1st blank: https://github.com/lh3/samtools/blob/master/faidx.c#L89

while ((ret = razf_read(rz, &c, 1)) != 0 && !isspace(c)) {

you could fix this by converting the spaces to an underline

tr " " "_" < in.fa > out.fa && samtools faidx out.fa
ADD COMMENT
0
Entering edit mode

I tried that, but no luck:

>first_seq
AGCATGCATCGTACGATCGTACGATCATCATC
GTACGATCGATCGATCGATCGATCGTGTGCTA
GACGATCGATCGATCGA
>second_seq 
GTACGATGCATCGATCGATCGATCGATCGATC
GTGATCGTGTCGTCGTCGTCGATAAGCATCGT
AGTACGATCGATCGTA
samtools faidx seq4.fa first_seq:1-5

>first_seq:1-5
ADD REPLY
0
Entering edit mode

Please show me the output of grep first_seq seq4.*

ADD REPLY
0
Entering edit mode

Just:

​>first_seq
ADD REPLY
1
Entering edit mode

If you really ran grep first_seq seq4.*, then you didn't correctly indexed the fasta sequence.

ADD REPLY
0
Entering edit mode

I think the blank lines are the issue, and samtools should have warned when the index creation was attempted. The output of samtools faidx seq4.fa may be informative.

ADD REPLY
0
Entering edit mode

Yep--I didn't index it at all. Silly error by me. Thanks for the help.

ADD REPLY

Login before adding your answer.

Traffic: 1542 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6