Question

Add Sequences To A List From A Complex Fasta File In Python

1

Entering edit mode

12.1 years ago

hicsuntdrac0nis ▴ 250

I'm trying to organize FASTA file with multiple sequences . In doing so, I'm trying to add the names to a list and add the sequences to a separate list that is parallel with the name list . I figured out how to add the names to a list but I can't figure out how to add the sequences that follow it into separate lists . I tried appending the lines of sequence into an empty string but it appended all the lines of all the sequences into a single string .

def Name_Organizer(FASTA,output):

    import os
    import re

    in_file=open(FASTA,'r')
    dir,file=os.path.split(FASTA)
    temp = os.path.join(dir,output)
    out_file=open(temp,'w')

    data=''
    name_list=[]

    for line in in_file:

        line=line.strip()
        for i in line:
            if i=='>':
                name_list.append(line)
                break
            else:
                line=line.upper()
        if all([k==k.upper() for k in line]):
            data=data+line

    print data

how do i add the sequences to a list as a set of strings ?

the input file looks like, but with the > before the name on the top line :

>44664.3|G1E3M3IX1IW|Greengenes|2471 16S ribosomal RNA [Microbacterium oxydans]
gactATAATTTGTAAATTTCTTGAGATAGAATCATTCGTATTGAATGAGGTCAAATTCTC
TAAACTGATTAAGAAGTATAATACTTAGATGCGAGTTATTGCATCACTTAACGGAGAGTT
TGATCCTGGCTCAGGATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGTGAAG
TCTGAATTGAGTACTTCGGTATGATATTTGGGTGGAAAGTGGCGGACGGGTGAGTAACAC
GTGGGTAACCTGCCTCGAAGTGGGGACAACCATTGGAAACGATGGCTAATACCGCATAGT
TCTTTAGATGCATGAGCATTTATAGATAAAACTCTGGTGCTTCGAGAGGGGTCTGCGTCC
GATTAGTTAGTTGGTGGGTAAAGGCCTACCAAGACGATGATCGGTAGCTGGTCTGAGAGG
ACGATCAGTCACACGGGAACTGAGACACGGTCCagtcgtgggagacaaggcacacagggg
ataggnnnnn


>44684.3|G1E3M3B01IW|Greengenes|2688 16S ribosomal RNA [Microbacterium oxydans]
gactATAATTTGTAAATTTCTTGAGATAGAATCATTCGTATTGAATGAGGTCAAATTCTC
TAAACTGATTAAGAAGTATAATACTTAGATGCGAGTTATTGCATCACTTAACGGAGAGTT
TGATCCTGGCTCAGGATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGTGAAG
TCTGAATTGAGTACTTCGGTATGATATTTGGGTGGAAAGTGGCGGACGGGTGAGTAACAC
GTGGGTAACCTGCCTCGAAGTGGGGACAACCATTGGAAACGATGGCTAATACCGCATAGT
TCTTTAGATGCATGAGCATTTATAGATAAAACTCTGGTGCTTCGAGAGGGGTCTGCGTCC
GATTAGTTAGTTGGTGGGTAAAGGCCTACCAAGACGATGATCGGTAGCTGGTCTGAGAGG
ACGATCAGTCACACGGGAACTGAGACACGGTCCagtcgtgggagacaaggcacacagggg
ataggnnnnn

sequence fasta python list • 12k views

ADD COMMENT • link 12.1 years ago by hicsuntdrac0nis ▴ 250

score 9 · Answer 1 · 2012-03-04

So basically you just want to parse a fasta file and put the contents in a header array and a sequence array? You can use BioPython's SeqIO module:

from Bio import SeqIO
import sys

headerList = []
seqList = []

inFile = open(sys.argv[1],'r')
for record in SeqIO.parse(inFile,'fasta'):
   headerList.appendrecord.id)
   seqList.append(str(record.seq))

If you don't want to use BioPython you can:

import sys

inFile = open(sys.argv[1],'r')

headerList = []
seqList = []
currentSeq = ''
for line in inFile:
   if line[0] == ">":
      headerList.append(line[1:].strip())
      if currentSeq != '':
         seqList.append(currentSeq)

      currentSeq = ''
   else:
      currentSeq += line.strip()

seqList.append(currentSeq)

score 1 · Answer 2 · 2012-03-04

When I wrote a FASTA reshuffle routine some time ago (to layout contigs as mapped by BLAT) I used Perl and (if memory permits) just threw all sequences itself in a hash with the key being the unique fasta header/identifyer.

For reshuffle and display in particular order I just had to skip through some ordered list of fasta headers and get the corresponding sequences from the hash.

In Python I think you can relatively easily do the same using the python alternative to hashes..dictionaries (I believe...not a python export). Together with the clues of Robert you should get there. Even though I usually prefere regular expressions but thats just preference.

score 1 · Answer 3 · 2012-03-06

i needed to reset the string

def Name_Organizer(FASTA,output):

import os
import re

in_file=open(FASTA,'r')
dir,file=os.path.split(FASTA)
temp = os.path.join(dir,output)
out_file=open(temp,'w')

data=''
name_list=[]
seq_list=[]

for line in in_file:

    line=line.strip()
    for i in line:
        if i=='>':
            name_list.append(line)
            if data:
                seq_list.append(data)
                data=''
            break
        else:
            line=line.upper()
    if all([k==k.upper() for k in line]):
        data=data+line

print seq_list

score 0 · Answer 4 · 2012-03-04

You are going in the right direction! You should add a piece of code that puts your data string into a "sequence_list" when the line starts with a ">". And further more to make sure that there are no empty data objects added to your sequence_list you should put this in a if statement (if data != ''). After doing this you want to reset your data object (data = '').

Furthermore your 2nd for loop (for i in line) is not required. You can just check the ">" with line[0] == >.

I hope that helps you to solve the problem yourself.