Question

Extract Sequences From Fasta File

0

Entering edit mode

10.1 years ago

sbb ▴ 10

I have a fasta file of approx 3 Megabasepairs (Accession no. CM001634.1) and want to extract the sequences of all regions given below and then save the extracted sequences in fasta format in new text file. Also want that new file should be containing accession of file and the name of repeat such as low_complexity, simple repeat.
Kindly suggest me perl script.

SW perc perc perc query position in query matching repeat position in repeat score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID

21 3.6 0.0 0.0 gi|411024077|gb|CM001634.1| 11554 11581 (28623556) + AT_rich Low_complexity 1 28 (0) 1
22 0.0 0.0 0.0 gi|411024077|gb|CM001634.1| 12224 12245 (28622892) + AT_rich Low_complexity 1 22 (0) 2
22 8.0 0.0 0.0 gi|411024077|gb|CM001634.1| 12609 12658 (28622479) + AT_rich Low_complexity 1 50 (0) 3
22 5.6 0.0 0.0 gi|411024077|gb|CM001634.1| 12691 12726 (28622411) + AT_rich Low_complexity 1 36 (0) 4
26 3.0 0.0 0.0 gi|411024077|gb|CM001634.1| 13965 13997 (28621140) + AT_rich Low_complexity 1 33 (0) 5
222 3.7 0.0 0.0 gi|411024077|gb|CM001634.1| 16373 16399 (28618738) + (TA)n Simple_repeat 1 27 (0) 6
219 12.5 0.0 0.0 gi|411024077|gb|CM001634.1| 17247 17286 (28617851) + (G)n Simple_repeat 1 40 (0) 7
198 0.0 0.0 0.0 gi|411024077|gb|CM001634.1| 20074 20095 (28615042) + (CAAAAA)n Simple_repeat 3 24 (0) 8
189 0.0 0.0 0.0 gi|411024077|gb|CM001634.1| 20344 20364 (28614773) + (TA)n Simple_repeat 1 21 (0) 9
23 0.0 0.0 0.0 gi|411024077|gb|CM001634.1| 21437 21459 (28613678) + AT_rich Low_complexity 1 23 (0) 10
198 0.0 0.0 0.0 gi|411024077|gb|CM001634.1| 22420 22441 (28612696) + (GAA)n Simple_repeat 2 23 (0) 11
189 0.0 0.0 0.0 gi|411024077|gb|CM001634.1| 27191 27211 (28607926) + (TTTTTG)n Simple_repeat 4 24 (0) 12
21 0.0 0.0 0.0 gi|411024077|gb|CM001634.1| 27481 27501 (28607636) + AT_rich Low_complexity 1 21 (0) 13
24 16.1 0.0 0.0 gi|411024077|gb|CM001634.1| 33144 33174 (28601963) + AT_rich Low_complexity 1 31 (0) 14
186 4.3 0.0 0.0 gi|411024077|gb|CM001634.1| 34503 34525 (28600612) + (CGAAT)n Simple_repeat 2 24 (0) 15

perl • 2.3k views

ADD COMMENT • link updated 8.1 years ago by Biostar 20 • written 10.1 years ago by sbb ▴ 10

3

Entering edit mode

What have you tried so far? This would seem to be solved simply by iterating over the rows (you can use samtools faidx to retrieve an arbitrary region of an indexed fasta file).

ADD REPLY • link 10.1 years ago by Devon Ryan 104k

Ram · Answer 1 · 2014-03-31

First I would echo what Devon suggested - both stating what you've attempted as well as using samtools faidx. I'm not sure about a Perl solution, but you can do this using my python fasta indexing module pyfaidx:

from pyfaidx import Fasta

with open("regions.txt") as regions, Fasta("sequence.fasta") as fasta:
  for line in regions:
    fields = line.rstrip().split()
    rname, start, end = fields[4:7]
    repeat = ' '.join(fields[9:11])
    seq = fasta[rname][int(start)-1:int(end)-1]
    printseq.name + repeat)
    print(seq.seq)

Which produces:

gi|411024077|gb|CM001634.1|AT_rich Low_complexity
TTTATATATATAATAATAATAGTAATA
gi|411024077|gb|CM001634.1|AT_rich Low_complexity
TATATTTATTTATTTTTTATT
gi|411024077|gb|CM001634.1|AT_rich Low_complexity
TTTTAAATGTAATAAAAAATTAAAACTACTTTCATTTTTATTATAATAT
gi|411024077|gb|CM001634.1|AT_rich Low_complexity
TATTATCATTTAAATAAATTTACAATATATATATA
gi|411024077|gb|CM001634.1|AT_rich Low_complexity
AATAAAATAATTAAAAATGTAAAAAATAAAAA
gi|411024077|gb|CM001634.1|(TA)n Simple_repeat
TATATATATATATATATATATATACA
gi|411024077|gb|CM001634.1|(G)n Simple_repeat
GGGGGGTAGGGGGGGGGGGGGGGATTGGGGGGGGGGGGG
...

Here is the Gist