Question

biopython script using MUSCLE stalls with >300 sequences

1

Entering edit mode

8.3 years ago

julestrachsel ▴ 20

Hello,

I am trying to learn biopython a little better and I am trying to write a python script that will take in some unaligned DNA sequences in fasta format from the command line, translate these to protein sequences, align these sequences with MUSCLE, and then write out the resulting protein alignment.

I run the script from the command line by typing python3 script.py input.fasta

Here is the script so far:

from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
from Bio.SeqUtils.CheckSum import seguid
from sys import argv
from Bio.Align.Applications import MuscleCommandline
from Bio import AlignIO
import subprocess
import sys

def unique_seqs(sequences):
    """returns a list of SeqRecord objects with redundant sequences removed"""
    unique_records = []
    checksum_container = []

    for seq in sequences:
        checksum = seguid(seq.seq)

        if checksum not in checksum_container:
            checksum_container.append(checksum)

            unique_records.append(seq)

    return unique_records

def make_protein_record(nuc_record):
    """Returns a new SeqRecord with the translated sequence (default table)."""
    return SeqRecord(seq=nuc_record.seq.translate(cds=False), id=nuc_record.id, description="translation using def table")

seqs = SeqIO.parse(argv[1], "fasta")
seqs = list(seqs)
prot = []
print("you input {} sequences".format(len(seqs)))
UNIQUE_Seqs = unique_seqs(seqs)
print("there are {} unique DNA sequences".format(len(UNIQUE_Seqs)))

for seq in UNIQUE_Seqs:
    seq = make_protein_record(seq)
    prot.append(seq)

protein_fasta_handle = "prot" + argv[1]

SeqIO.write(prot, protein_fasta_handle, "fasta")

muscle_cline = MuscleCommandline()
child = subprocess.Popen(str(muscle_cline),
                          stdin =subprocess.PIPE,
                          stdout =subprocess.PIPE,
                          stderr =subprocess.PIPE,
                          universal_newlines =True,
                          shell =(sys.platform != "win32"))

SeqIO.write(prot, child.stdin, "fasta")
child.stdin.close()
align_handle = "ALN" + protein_fasta_handle
align = AlignIO.read(child.stdout, "fasta")
AlignIO.write(align, align_handle, "fasta")
print(align)

This script works, however it only works when I give it a low number of starting sequences. I have successfully gotten it to work with a fasta containing 8 sequences, but when I tried to use 300 starting sequences the script freezes. I initially thought it was being non-responsive because the MUSCLE alignment was taking a long time. To test this I ran MUSCLE from outside of biopython and the alignment only took 20 seconds.

By inserting print statements at various places I learned that the problem arises on line 43:

align = AlignIO.read(child.stdout, "fasta")

I haven't seen any error messages or anything of that nature. Can anyone spot something I'm doing wrong here?

Thank you for your time.

sequence biopython alignment • 2.8k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.3 years ago by julestrachsel ▴ 20

0

Entering edit mode

Try adding parameters to the MuscleCommandline() function; remove the stdin parameter in the Popen() function, after that add the line: stdout, stdout = child.communicate() and remove the line SeqIO.write(prot, child.stdin, "fasta") which I don't really understand what are you doing with it because you are already saving the protein sequences in fasta format.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by rafa.rios.50 ▴ 60

0

Entering edit mode

Working directly with pipes like this (child.stdout etc) is much harder to debug. Try making temporary files instead.

ADD REPLY • link 8.2 years ago by Peter 6.0k