Question

Filtering sequences from a Fasta file using Biopython

2

Entering edit mode

7.3 years ago

tpaisie ▴ 80

Hi everyone, I'm pretty sure this would be a super simple python script using Biopython, but could someone point me in the right direction? I have a Fasta file with 229 sequences, they are named by only their genbank accession number. I have a text file with the genbank accession numbers of the sequences I want to extract from the original fasta file with the 229 sequences. I would like to be able to filter out the sequences listed in my text file to create a new fasta file, with only my sequences of interest.

I feel like I accidentally did this once when I was trying to write a script for another purpose. I also am trying to actively learn biopython, so just a push in the right direction would help a lot!! Thanks!

Biopython Python Phylogenetics Fasta • 8.3k views

ADD COMMENT • link 7.3 years ago by tpaisie ▴ 80

WouterDeCoster · Answer 1 · 2017-01-05

1

Entering edit mode

7.3 years ago

WouterDeCoster 47k

I assume you went through the Biopython Tutorial and Cookbook?

This is indeed pretty straightforward. Read the text file with accession numbers in a list (desiredList), iterate over your fasta file and if the record.id attribute is in the desiredList, add it to an output list. Then after looping, write the desired sequences to an output file. Need more pointers?

ADD COMMENT • link 7.3 years ago by WouterDeCoster 47k

0

Entering edit mode

so biopython has this as their filtering example:

from Bio import SeqIO
input_file = "big_file.sff"
id_file = "short_list.txt"
output_file = "short_list.sff"
wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print("Found %i unique identifiers in %s" % (len(wanted), id_file))
records = (r for r in SeqIO.parse(input_file, "sff") if r.id in wanted)
count = SeqIO.write(records, output_file, "sff")
print("Saved %i records from %s to %s" % (count, input_file, output_file))
if count < len(wanted):
    print("Warning %i IDs not found in %s" % (len(wanted)-count, input_file))

so when i modify this example to my files it gives me only one sequence in my output fasta file, I guess i'm confused why its doing that.

ADD REPLY • link updated 7.3 years ago by WouterDeCoster 47k • written 7.3 years ago by tpaisie ▴ 80

0

Entering edit mode

I assume you adapted the code to work for fasta format? Could you show your code as well? This seems right...

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

0

Entering edit mode

from Bio import SeqIO
input_file = "cov_uniq_aln_103016.fasta"
id_file = "CoV_seq_names_010517.txt"
output_file = "cov_less_seqs_010517.fasta"
wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print("Found %i unique identifiers in %s" % (len(wanted), id_file))
records = (r for r in SeqIO.parse(input_file, "fasta") if r.id in wanted)
count = SeqIO.write(records, output_file, "fasta")
print("Saved %i records from %s to %s" % (count, input_file, output_file))
if count < len(wanted):
    print("Warning %i IDs not found in %s" % (len(wanted)-count, input_file))

Maybe i'm missing something in the code for the fasta format? it does output a fasta file, just only with one sequence instead of the 80 or so sequence ids in the text file.

ADD REPLY • link updated 7.3 years ago by WouterDeCoster 47k • written 7.3 years ago by tpaisie ▴ 80

0

Entering edit mode

And the wanted variable looks okay? It contains the expected number of identifiers (and the expected values)?

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

0

Entering edit mode

This is my output from the terminal:

Found 1 unique identifiers in CoV_seq_names_010517.txt Saved 1 records from cov_uniq_aln_103016.fasta to cov_less_seqs_010517.fasta

And the output fasta file has only the first sequences on my id_file, so I think the problem is this line:

wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))

ADD REPLY • link 7.3 years ago by tpaisie ▴ 80

0

Entering edit mode

Indeed, only one identifier was found. You will need to modify the line you mentioned. How does your id_file look like? How is it separated?

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

score 1 · Answer 2 · 2017-01-05

1

Entering edit mode

7.3 years ago

mforde84 ★ 1.4k

You're making this way too hard on yourself. Bash is pretty awesome if you give it a try. :)

$ head lookup.txt;
GI123
GI987
GI3873
$ while read -r line; do
     sed -n -e "/>$line/,/>/ p" sequence.fasta | head -n-1 >> newfasta.txt;
  done < "lookup.txt";

ADD COMMENT • link 7.3 years ago by mforde84 ★ 1.4k

1

Entering edit mode

Definitely agree with the usefulness of bash for simple tasks, but in the spirit of learning biopython for more complex tasks I understand OP tries Biopython for this job as well.

(And I think I don't need a lot more lines in Python to do the same job ;-)
Constructs like "/>$line/,/>/ p" scare me. )

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

score 1 · Answer 3 · 2017-01-05

Wrote a multipurpose fasta-fetcher a while back here: https://github.com/jrjhealey/bioinfo-tools/blob/master/fastafetcher.py

It supports finding the right fasta by a file of keys, by string directly, and also has an 'invert' parameter to retrieve everything but the ones you list. Usage I think is pretty self explanatory (script has a help function thanks to argparse) but feel free to ask for clarification:

python fastafetcher.py -k keyfile.txt -f fastafile.fasta [--invert]

key file has to be a simple list of exact strings, one per line.

# Extract fasta files by their descriptors stored in a separate file.
# Requires biopython

import StringIO
from Bio import SeqIO
import sys
import traceback
import warnings
import argparse

def getKeys(keyFile):
    """Turns the input key file into a tuple. May be memory intensive."""

    with open(keyFile, "r") as kfh:
        keys = []
            for line in kfh:
            line = line.rstrip('\n')
            line = line.lstrip('>')
            keys.append(line)

    return keys

def main():
    """Takes a list of strings in a text file (one per line) and retreives them and their sequences from a provided multifasta."""
    # Parse arguments from the commandline:
    try:
        parser = argparse.ArgumentParser(description='Retrieve one or more fastas from a given multifasta.')
        parser.add_argument(
            '-f',
            '--fasta',
            action='store',
            required=True,
            help='The multifasta to search.')
        parser.add_argument(
            '-k',
            '--keys',
            action='store',
            required=True,
            help='A file of header strings to search the multifasta for. Must be exact. Must be one per line.')
        parser.add_argument(
            '-o',
            '--outfile',
            action='store',
            default=None,
            help='Output file to store the new fasta sequences in. Just prints to screen by default.')
        parser.add_argument(
            '-v',
            '--verbose',
            action='store_true',
            help='Set whether to print the key list out before the fasta sequences. Useful for debugging.')
        parser.add_argument(
            '-i',
            '--invert',
            action='store_true',
            help='Invert the search, and retrieve all sequences NOT specified in the keyfile.')
        args = parser.parse_args()

    except:
        print "An exception occured with argument parsing. Check your provided options."
        traceback.print_exc()

    # Rename args to enable them to be provided to the getKeys function:
    keyFile = args.keys
    inFile = args.fasta
    outFile = args.outfile
# Main code:
    # Call getKeys() to create the tuple of keys from the provided file:
    try:
        keys = getKeys(keyFile)
    except IOError:
        keys = args.keys


    if args.verbose is not False:
        if args.invert is False:
            print('Fetching the following keys from: ' + inFile)
            for key in keys:
                print(key)
        else:
            print('Ignoring the following keys, and retreiving everything else from: ' + inFile)
            for key in keys:
                print(key)

    # Parse in the multifasta and assign an iterable variable:
        seqIter = SeqIO.parse(inFile, 'fasta')

    # For each sequence in the multifasta, check if it's in the keys[] tuple. If so, print it out:
    for seq in seqIter:
        if args.invert is False:
            if seq.id in keys:
                print(seq.format("fasta"))
            if args.outfile is not None:
                SeqIO.write(seq, outFile, "fasta")
        else:
    # If the --invert/-i flag is given, print all fastas NOT listed in the keyfile
            if seq.id not in keys:
                print(seq.format("fasta"))
            if args.outfile is not None:
                SeqIO.write(seq, outFile, "fasta")

if __name__ == "__main__":
    main()

score 0 · Answer 4 · 2017-01-05

0

Entering edit mode

7.3 years ago

tpaisie ▴ 80

So I figured it out! The original script works, I saved my id_file has a tab delimited file in excel and that was the problem! You guys are really motivating me to work on my python!

ADD COMMENT • link 7.3 years ago by tpaisie ▴ 80

0

Entering edit mode

Good to hear! You were pretty close to the solution yourself, too.

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

0

Entering edit mode

Excel strikes again!

ADD REPLY • link 7.3 years ago by Joe 21k