Help: do I always have to re-open sequence file within a for loop in python?
4
1
Entering edit mode
8.2 years ago
pawlowac ▴ 80

I'm trying to extract nucleotide sequences from a draft genome based on alignment coordinates from a SAM file. Right now, I'm opening the SAM file as a csv, then matching sequence name and finally grabbing the sequence based on the coordinates.

import csv
import re
from Bio import SeqIO

bam_file=open('orfH9_offtarget.sam', 'rb')
bam_reader=csv.reader(bam_file,delimiter='\t')
offtarget_results = open('offtarget_orfH9.txt', 'w')

unique_seq_names=[]
parsed_data=[]
num=[]

for row in bam_reader:
    if row[2]:
        genome_seq=SeqIO.parse(open('genome.fasta', 'rU'), 'fasta')
        for contig in genome_seq:
            if row[2] in contig.id:
                    if int(row[1]) == 256:
                        I = int(row[3])
                        num.append(contig.seq[i:i+20])
                    elif int(row[1]) == 272:
                        I = int(row[3])
                        num.append(contig.seq.reverse_complement()[i:i+20])
                    elif int(row[1]) == 0:
                        I = int(row[3])
                        num.append(contig.seq[i:i+20])
                    elif int(row[1]) == 16:
                        I = int(row[3])
                        num.append(contig.seq.reverse_complement()[i:i+20])

     print("length%s" % len(num))

row[1] has flags that dictate whether I pull the reverse complement. 0 and 16 mean exact match while the other two flags mean non exact match. I'm only appending to a list right now to make sure that I successfully get each sequence. It will write to a file when it works.

This code works exactly as intended, but it's SLOW. I suspect the problem is re-opening the draft genome sequence file for every row, but I can't open it outside of the for loop.

Any suggestions? (Keeping in mind that I am a wet-lab scientist that knows a little python)

SAM python bowtie2 biopython • 3.2k views
ADD COMMENT
2
Entering edit mode

ok, I know close to nothing of Python, but I understand you are opening the same genome over and over, right?

genome_seq=SeqIO.parse(open('genome.fasta', 'rU'), 'fasta')

If this is the case, of course you can move reading the genome to outside the loop. Then you will read it only once and keep it in memory.

ADD REPLY
1
Entering edit mode

I think reading the whole reference sequence and keeping it in memory is the way to go if you want to query many random positions. Even if fasta files are indexed, querying millions of positions will take much much longer then accessing the genome stored memory.

This simple function will read the reference fasta and make a dictionary of it (not as tested and robust as biopython but you drop a dependency):

def fastaToDict(faName):
    """Read fasta file faName and put it in dict with as {seqname:sequence}
    """
    fasta= open(faName)
    genome= {}
    for line in fasta:
        line= line.strip()
        if line.startswith('>'):
            chrom= line.lstrip('>').split()[0]
            genome[chrom]= []
            continue
        genome[chrom].append(line)
    fasta.close()
    for chrom in genome:
        genome[chrom]= ''.join(genome[chrom])
    return genome
ADD REPLY
2
Entering edit mode
8.2 years ago

I can see that you're still developing this script, so I'll point out some code that might help you along the way. This snippet uses two Python modules I maintain:

Your current code is slow because each time you open the FASTA file you read through from top to bottom to find the sequence you want. This is unnecessary since FASTA files can be indexed and randomly accessed using pyfaidx. The "simplesam" module is an alternative to pysam and doesn't require any compiler - it's all Python - but you can go ahead and parse the SAM format if you are doing this as a learning exercise.

ADD COMMENT
1
Entering edit mode
8.2 years ago
James Ashmore ★ 3.4k

A non-python solution would be to use bedtools:

# Create BED file of alignment coordinates, name field is sequence name, score field is flag number
bedtools bamtobed -tag -i orfH9_offtarget.sam > orfH9_offtarget.bed
# FASTA file of sequences corresponding to the alignment coordinates
bedtools getfasta -s -fi genome.fasta -bed orfH9_offtarget.bed > orfH9_offtarget.fasta
ADD COMMENT
0
Entering edit mode

It's a good idea but as I said in my comment above if you have a bam file of millions of reads, bedtools getfasta will take ages.

ADD REPLY
1
Entering edit mode
8.1 years ago
Peter 6.0k

Rather than using SeqIO.parse to loop over the file every time, use SeqIO.index or SeqIO.index_db which can jump directly to the FASTA entry of interest and parse it on demand. There are example in the Biopython Tutorial http://biopython.org/DIST/docs/tutorial/Tutorial.html

ADD COMMENT
0
Entering edit mode

I don't know about SeqIO.index, but if it uses the index to query the reference fasta in disk, it will take a long time to query millions of positions. Better to read the reference in memory, if possible.

(This is essentially the same comment I posted above.)

ADD REPLY
0
Entering edit mode
8.2 years ago
John 13k

Personally, I think it would be better to reduce the size/complexity of the input data first, by extracting from the SAM file all the data you need (and only the data you need) first, then once thats done using it to get DNA sequences:

import csv
import re
from Bio import SeqIO
import string
import collections

bam_file=open('orfH9_offtarget.sam', 'rb')
bam_reader=csv.reader(bam_file,delimiter='\t')
offtarget_results = open('offtarget_orfH9.txt', 'w')

contigs = set()
genome_seq = SeqIO.parse(open('genome.fasta', 'rU'), 'fasta')
for contig in genome_seq: contigs.add(contig)

forward = collections.defaultdict(list)
reverse = collections.defaultdict(list)

revcomp = string.maketrans('ACGT','TGCA')
def rc(DNA): return DNA.translate(revcomp)[::-1]

for row in bam_reader:
  if row[2] in contigs:
    if row[1] in ('256','0'):         # no need to int(row[1]) :)
      forward[row[2]].append(row[3])
    else:
      reverse[row[2]].append(row[3])

num = []
genome_seq = SeqIO.parse(open('genome.fasta', 'rU'), 'fasta')
for contig in genome_seq:
  seq = contig.seq
  for x in forward[contig]: num.append(seq[x:x+20])
  for x in reverse[contig]: num.append(rc(seq[x:x+20]))

print("length%s" % len(num))

Something like that. If i could have made some assumptions about your data (like, your SAM file is sorted by contig and pos) then I would have made it so we read X amount of data from the SAM file, until we come across a read that is in a different contig - then we do the DNA for the previous contig first, before moving on with new data. This would reduce the RAM used, but not by a whole lot.

Furthermore, if you have multiple reads starting at the same place, then I would have used a dict of {position:counter} rather than a list of positions (with repetition).

ADD COMMENT

Login before adding your answer.

Traffic: 2966 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6