Add Sequences To A List From A Complex Fasta File In Python
4
1
Entering edit mode
12.1 years ago

I'm trying to organize FASTA file with multiple sequences . In doing so, I'm trying to add the names to a list and add the sequences to a separate list that is parallel with the name list . I figured out how to add the names to a list but I can't figure out how to add the sequences that follow it into separate lists . I tried appending the lines of sequence into an empty string but it appended all the lines of all the sequences into a single string .

def Name_Organizer(FASTA,output):

    import os
    import re

    in_file=open(FASTA,'r')
    dir,file=os.path.split(FASTA)
    temp = os.path.join(dir,output)
    out_file=open(temp,'w')

    data=''
    name_list=[]

    for line in in_file:

        line=line.strip()
        for i in line:
            if i=='>':
                name_list.append(line)
                break
            else:
                line=line.upper()
        if all([k==k.upper() for k in line]):
            data=data+line

    print data

how do i add the sequences to a list as a set of strings ?

the input file looks like, but with the > before the name on the top line :

>44664.3|G1E3M3IX1IW|Greengenes|2471 16S ribosomal RNA [Microbacterium oxydans]
gactATAATTTGTAAATTTCTTGAGATAGAATCATTCGTATTGAATGAGGTCAAATTCTC
TAAACTGATTAAGAAGTATAATACTTAGATGCGAGTTATTGCATCACTTAACGGAGAGTT
TGATCCTGGCTCAGGATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGTGAAG
TCTGAATTGAGTACTTCGGTATGATATTTGGGTGGAAAGTGGCGGACGGGTGAGTAACAC
GTGGGTAACCTGCCTCGAAGTGGGGACAACCATTGGAAACGATGGCTAATACCGCATAGT
TCTTTAGATGCATGAGCATTTATAGATAAAACTCTGGTGCTTCGAGAGGGGTCTGCGTCC
GATTAGTTAGTTGGTGGGTAAAGGCCTACCAAGACGATGATCGGTAGCTGGTCTGAGAGG
ACGATCAGTCACACGGGAACTGAGACACGGTCCagtcgtgggagacaaggcacacagggg
ataggnnnnn


>44684.3|G1E3M3B01IW|Greengenes|2688 16S ribosomal RNA [Microbacterium oxydans]
gactATAATTTGTAAATTTCTTGAGATAGAATCATTCGTATTGAATGAGGTCAAATTCTC
TAAACTGATTAAGAAGTATAATACTTAGATGCGAGTTATTGCATCACTTAACGGAGAGTT
TGATCCTGGCTCAGGATGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGTGAAG
TCTGAATTGAGTACTTCGGTATGATATTTGGGTGGAAAGTGGCGGACGGGTGAGTAACAC
GTGGGTAACCTGCCTCGAAGTGGGGACAACCATTGGAAACGATGGCTAATACCGCATAGT
TCTTTAGATGCATGAGCATTTATAGATAAAACTCTGGTGCTTCGAGAGGGGTCTGCGTCC
GATTAGTTAGTTGGTGGGTAAAGGCCTACCAAGACGATGATCGGTAGCTGGTCTGAGAGG
ACGATCAGTCACACGGGAACTGAGACACGGTCCagtcgtgggagacaaggcacacagggg
ataggnnnnn
sequence fasta python list • 12k views
ADD COMMENT
9
Entering edit mode
12.1 years ago

So basically you just want to parse a fasta file and put the contents in a header array and a sequence array? You can use BioPython's SeqIO module:

from Bio import SeqIO
import sys

headerList = []
seqList = []

inFile = open(sys.argv[1],'r')
for record in SeqIO.parse(inFile,'fasta'):
   headerList.appendrecord.id)
   seqList.append(str(record.seq))

If you don't want to use BioPython you can:

import sys

inFile = open(sys.argv[1],'r')

headerList = []
seqList = []
currentSeq = ''
for line in inFile:
   if line[0] == ">":
      headerList.append(line[1:].strip())
      if currentSeq != '':
         seqList.append(currentSeq)

      currentSeq = ''
   else:
      currentSeq += line.strip()

seqList.append(currentSeq)
ADD COMMENT
1
Entering edit mode
12.1 years ago
ALchEmiXt ★ 1.9k

When I wrote a FASTA reshuffle routine some time ago (to layout contigs as mapped by BLAT) I used Perl and (if memory permits) just threw all sequences itself in a hash with the key being the unique fasta header/identifyer.

For reshuffle and display in particular order I just had to skip through some ordered list of fasta headers and get the corresponding sequences from the hash.

In Python I think you can relatively easily do the same using the python alternative to hashes..dictionaries (I believe...not a python export). Together with the clues of Robert you should get there. Even though I usually prefere regular expressions but thats just preference.

ADD COMMENT
1
Entering edit mode
12.1 years ago

i needed to reset the string

def Name_Organizer(FASTA,output):

import os
import re

in_file=open(FASTA,'r')
dir,file=os.path.split(FASTA)
temp = os.path.join(dir,output)
out_file=open(temp,'w')

data=''
name_list=[]
seq_list=[]

for line in in_file:

    line=line.strip()
    for i in line:
        if i=='>':
            name_list.append(line)
            if data:
                seq_list.append(data)
                data=''
            break
        else:
            line=line.upper()
    if all([k==k.upper() for k in line]):
        data=data+line

print seq_list
ADD COMMENT
0
Entering edit mode
12.1 years ago
Robert Ernst ▴ 60

You are going in the right direction! You should add a piece of code that puts your data string into a "sequence_list" when the line starts with a ">". And further more to make sure that there are no empty data objects added to your sequence_list you should put this in a if statement (if data != ''). After doing this you want to reset your data object (data = '').

Furthermore your 2nd for loop (for i in line) is not required. You can just check the ">" with line[0] == >.

I hope that helps you to solve the problem yourself.

ADD COMMENT

Login before adding your answer.

Traffic: 1926 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6