Question

extracting sequence from fast file

1

Entering edit mode

6.0 years ago

blooming.daisy333 ▴ 110

im newbie to linux and NGS. I want to extract specific sequences from genome.fasta file. the IDs which i have are as follows

gnl|BGIA2|CA_chr1:329729-329840
gnl|BGIA2|CA_chr1:359729-379840
.............so on...........

where gnl|BGIA2|CA_chr1 is the name of the chromosome and 329729-329840 is the range of the nucleotides needed to be extracted in separate file alonge with its ID. can someone kindly help me how to do this with samtools or any other tool or command/script.

many thanks

next-gen • 2.2k views

ADD COMMENT • link updated 6.0 years ago by JJ ▴ 680 • written 6.0 years ago by blooming.daisy333 ▴ 110

0

Entering edit mode

Possible duplicate of How To Extract A Sequence From A Big (6Gb) Multifasta File ?

ADD REPLY • link 6.0 years ago by Fabio Marroni ★ 3.0k

0

Entering edit mode

Hello daisy,

is this connected to your other question?

fin swimmer

ADD REPLY • link 6.0 years ago by finswimmer 16k

0

Entering edit mode

Dear Fin Swimmer, thank you so much for your kind help and precious time. no it is not related to the previous question as that was to extract sequence reads that have sequences spanning specific range. this question is to extract that range from fasta file/genome.fasta..

ADD REPLY • link 6.0 years ago by blooming.daisy333 ▴ 110

score 0 · Answer 1 · 2018-04-26

Here a quick solution with python using pyfaidx module:

import re
from pyfaidx import Fasta


genome = Fasta('genome.fa')

with open('list.txt', "r") as list,  open('output.fa', "w") as output:
    for id in list:
        region = re.split("[:|-]", id.strip())

        seq = genome[region[0]][int(region[1])-1:int(region[2])].seq
        output.write(">"+id.strip()+"\n")
        output.write(seq+"\n")

EDIT:

I always forget bedtools.

So we need again to convert your ID list to a bed file:

 awk -v OFS="\t" '{n=split($0,a,/[:\-]/); print a[1],a[2]-1,a[3],a[0]}' list.txt > ranges.bed

And can than use bedtools getfasta

bedtools getfasta -o output.fa -fi genome.fa -bed ranges.bed

fin swimmer

score 0 · Answer 2 · 2018-04-26

0

Entering edit mode

6.0 years ago

JJ ▴ 680

Check out the gffread utility. gffread can be used to generate a FASTA file with the DNA sequences for all transcripts in a GFF file. So you would only need to convert your id file to a gff.

ADD COMMENT • link 6.0 years ago by JJ ▴ 680