Question

Extracting Gene nucleotide sequences from a Genbank files using Biopython

1

Entering edit mode

7.0 years ago

tpaisie ▴ 80

Hi everyone, I'm trying to extract a CDS based on the product value. Basically want to extract a certain gene from multiple genbank files. This is the code I have so far:

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqFeature import SeqFeature
from Bio.SeqRecord import SeqRecord

gb_file = open("fimA_seqs.gb", "r")

for gb_record in SeqIO.parse(gb_file, "genbank"):
    # now do something with the record
    for feat in gb_record.features:
        if feat.type == "CDS":
            product = feat.qualifiers['product'][0]
            if product == 'Porphyromonas gingivalis major fimbrial subunit protein (FimA)' :

So I want to take the nucleotide sequence from any CDS (feature) with that product (qualifier) label and put them all in the same fasta file. Hopefully that makes sense! Any help would be greatly appreciated!!

python gene sequence • 9.9k views

ADD COMMENT • link updated 7.0 years ago by Joe 21k • written 7.0 years ago by tpaisie ▴ 80

score 0 · Answer 1 · 2017-04-26

0

Entering edit mode

7.0 years ago

Joe 21k

See my answer here - specifically the last script to extract CDS features as NA.

A: How do can I use Biopython and SeqIO to parse out multiple genes from several NC

You should be able to modify the code easily with something akin to your line:

 if feature.id == 'Porphyromonas gingivalis major fimbrial subunit protein (FimA)' :

Obviously this will only work if the ID matches exactly. I don't know what your input file is like. Something like:

if 'FimA' in feature.id:

might work if the feature name isn't always the same - have a play around. This link might also be useful: http://biopython.org/DIST/docs/api/Bio.SeqFeature.SeqFeature-class.html

This is a pretty common question on the forum BTW so make sure to search.

ADD COMMENT • link 7.0 years ago by Joe 21k

0

Entering edit mode

So i can't incorporate feature.id in my script and work. Are you saying instead of saying:

if product == 'Porphyromonas gingivalis major fimbrial subunit protein (FimA)' :

I should put:

if feature.id == 'Porphyromonas gingivalis major fimbrial subunit protein (FimA)' :

or: if 'FimA' in feature.id:

I also would like my script to use SeqIO.write to make the fasta file.

ADD REPLY • link 7.0 years ago by tpaisie ▴ 80

2

Entering edit mode

My mistake, I misunderstood what you were trying to do. Here's what you'd need to try. I haven't implimented a SeqIO write, but that should be fairly straightforward for you.

# $ python extractNAfeature.py genbank.gbk outfile.fasta

from Bio import SeqIO
import sys

product = 'NADH dehydrogenase subunit 5'

with open(sys.argv[2], 'w') as nfh:
    for rec in SeqIO.parse(sys.argv[1], "genbank"):
            if rec.features:
                    for feature in rec.features:
                            if feature.type == "CDS":
                                if feature.qualifiers['product'][0] == product:
                                    nfh.write(">%s|%s from %s\n%s\n" % (
                                              feature.qualifiers['gene'][0],
                                              feature.qualifiers['product'][0],
                                              rec.name,
                                              feature.location.extract(rec).seq))

Note that that is kind of old syntax for building a string these days, but it works fine for this.

ADD REPLY • link 4.0 years ago by Joe 21k

1

Entering edit mode

Thank you. I was able to adapt this for my own work. The code works great. Cheers.

ADD REPLY • link 6.7 years ago by 7ejoseph ▴ 20

0

Entering edit mode

So its when I used this code, just changed the product with the string i want to find, it gives me no error, but my output file is completely empty. I'm not sure what is going wrong, the code makes sense to me. With no error message i'm still kind of lost.

ADD REPLY • link 7.0 years ago by tpaisie ▴ 80

0

Entering edit mode

It needs to match exactly. You'll need to copy in the relevant section of the genbank and your exact search query for me to be able to help you any further. The code works just fine for me with a dummy genbank and random query

ADD REPLY • link 7.0 years ago by Joe 21k

0

Entering edit mode

yep my "product =" was incorrect. So if i wanted to make have it search for a gene, so it just looks for:

product = "FimA"

instead of the exact phrase, how would I go about that?

ADD REPLY • link 7.0 years ago by tpaisie ▴ 80

1

Entering edit mode

Really easy. Python has nice tricks when it comes to comparing strings. If you simply change:

if feature.qualifiers['product'][0] == product:

to

if product in feature.qualifiers['product'][0]:

You'll get everything that contains that string. Bear in mind this will still be case sensitive, and will give you every gene that has that string in it's product line. Since product lines can be quite variable you might get some genes back which aren't FimA, but maybe interact with it and mention it in the product line, that kind of thing.

In the test files I'm using for example (a mitochondrial genome GBK) and a random gene, in this case, NADH, I get 6 results in my fasta file:

>ND1|NADH dehydrogenase subunit 1 from NC_010689
>ND2|NADH dehydrogenase subunit 2 from NC_010689
>ND3|NADH dehydrogenase subunit 3 from NC_010689
>ND4L|NADH dehydrogenase subunit 4L from NC_010689
>ND4|NADH dehydrogenase subunit 4 from NC_010689
>ND5|NADH dehydrogenase subunit 5 from NC_010689

ADD REPLY • link 7.0 years ago by Joe 21k

1

Entering edit mode

Thank you so much for your help! that worked!!

ADD REPLY • link 7.0 years ago by tpaisie ▴ 80

0

Entering edit mode

If it's all solved, make sure you accept this as an answer :)

ADD REPLY • link 7.0 years ago by Joe 21k