Question

GenBank features don't extract sequences based on condition Biopython

1

Entering edit mode

5.3 years ago

Shred ★ 1.4k

Guys, I've wrote a script to extract sequences between the 23S rRNA and 16S rRNA gene in Python using Biopython. I can't figure out why, even if I've set explicit conditions, it keeps extracting randome sequences into the genome. Here's the code:

for f in glob.glob(cwd+'/genomes/*.gbk'):
    gbank = SeqIO.parse(open(f,'r'), "genbank")
    with open(output_file, "w") as nfh:
        for genome in gbank:
            for gene in genome.features:
                if gene.type=='rRNA':
                    if gene.location.strand == -1:   # wanted just the reverse strand
                        if 'product' in gene.qualifiers:
                            if '16S' in gene.qualifiers['product'][0]:
                                end = gene.location.nofuzzy_start    
                            elif '23S' in gene.qualifiers['product'][0]:
                                start = gene.location.nofuzzy_end
                            if abs(start - end) < 50000:
                                intergenic_sequence.append(">%s|ITS_full_sequence|%s\n%s" % (genome.description, gene.location.strand, genome.seq[start:end]))

I've set gene.type to rRNA and it extract a sequence into a random CDS, nothing to do with ribosomal RNA. This problem happens with several genomes tested (RefSeq records, well annotated).

biopython genbank annotation • 2.5k views

ADD COMMENT • link 5.3 years ago by Shred ★ 1.4k

0

Entering edit mode

You could give my code a try: https://github.com/jrjhealey/bioinfo-tools/blob/master/get_spans.py

I've not tested it with rRNA features (or much at all frankly).

It might be worth reading your files in an interactive python session. It's possible the 'key' rRNA isn't what's exactly in the genbank file (though I'm clutching at straws there a bit).

At a glance, I can't see any obvious issue with your code. Perhaps take a step back and remove the if '16S'/ if '23S' conditionals, and just ensure you're getting results you'd expect to get at each stage, before adding in another layer of complexity/filtering.

Side note, you dont need to use open() with SeqIO.parse - it handles all that internally.

ADD REPLY • link 5.3 years ago by Joe 21k

0

Entering edit mode

The key rRNA seems to work if I want to extract a 16S rRNA gene. Every genome tested was open previously with artemis where the search by key gives the right output every time. By far, I don't get why other condition values true even if they're explicitly wrong. I'm looking for gene with 16S inside product qualifiers, how could be even possible that a pilus associated CDS was extracted? May this be a bug or something else?

ADD REPLY • link 5.3 years ago by Shred ★ 1.4k

0

Entering edit mode

Can you provide a link to one of your input genbanks and I'll have a little play with the data myself.

ADD REPLY • link 5.3 years ago by Joe 21k

0

Entering edit mode

Sure, here's one. Bifidobacterium bifidum PRL2010 (https://www.ncbi.nlm.nih.gov/assembly/GCA_000165905.1)

ADD REPLY • link 5.3 years ago by Shred ★ 1.4k

0

Entering edit mode

Another small comment, I assume its elsewhere in your script, but you haven't declared the vector intergenic_sequence in that code snippet, so as posted, it would fail based on that alone.

ADD REPLY • link 5.3 years ago by Joe 21k

1

Entering edit mode

intergenic sequence is a list declared before, not copied just to be short.

ADD REPLY • link 5.3 years ago by Shred ★ 1.4k

2

Entering edit mode

5.3 years ago

Joe 21k

I'm not 100% sure if this code is right, but hopefully it gets you closer to the desired outcome. It's too late in the day for me to debug your code so it seemed easier to sort of start over.

import sys
from Bio import SeqIO

entry = SeqIO.read(sys.argv[1], 'genbank') # This will need to be changed to `.parse()` and looped over if there are multiple entries in a single .gbk
rnas = [feat for feat in entry.features if feat.type == 'rRNA']

s = [r for r in rnas if '16S' in r.qualifiers['product'][0] or '23S' in r.qualifiers['product'][0]]

start = min([y.location.nofuzzy_start for y in s])
end = max([x.location.nofuzzy_end for x in s])

desired = set(xrange(int(start),int(end),1))
for feat in entry.features:
    span = set(xrange(feat.location._start.position, feat.location._end.position))
    if span & desired:
        print(feat)

# If you want all sequence including intergenic spans, replace the block above with
# entry[start:end]
# this will create a 'slice' of the genome between the 2 most extreme rRNA coordinates

If I've interpreted what you want correctly, this should find all features that occur between the farthest separated 16S/23S RNAs. Its probably not the fastest nor most elegant code since it iterates all the features twice but I think it suffices.

I haven't sanity checked the output to see if its correct though, so you'll have to see whether it's giving you junk back or not. You'll also have to tweak the code a little if you want output in a certain format etc.

ADD COMMENT • link 5.3 years ago by Joe 21k

1

Entering edit mode

Ouh, you're damn close to the solution. I've tested your code, by changing xrange with range because of Python 3.x. I've changed the last print statement with print(entry[start:end]) as suggested, and by printing start and end coordinates to investigate on it in Artemis. The only problem is that, as you could probably know, in bacteria there's more than a single rRNA locus. So, as written, your code identifies the 1st 23S gene and the last 16S gene. How could I iterate your solution in order to have start/end point in every rRNA locus?

ADD REPLY • link 5.3 years ago by Shred ★ 1.4k

0

Entering edit mode

Each of the features corresponding to a 16S or 23S rRNA is contained within the list s. Do you mean you want the coordinates of each 16S/23S? I'm still not totally understanding exactly what the desired output is. If you want to see all the coordinates for the rRNAs though, just do:

import sys
from Bio import SeqIO

entry = SeqIO.read(sys.argv[1], 'genbank') # This will need to be changed to `.parse()` and looped over if there are multiple entries in a single .gbk
rnas = [feat for feat in entry.features if feat.type == 'rRNA']

s = [r for r in rnas if '16S' in r.qualifiers['product'][0] or '23S' in r.qualifiers['product'][0]]

print([i for i in s]) # (Since you're on python3 this syntax will work)

ADD REPLY • link 5.3 years ago by Joe 21k

0

Entering edit mode

I'll try to be clearer using your last code. The print statement as written above gives me this result:

   [SeqFeature(FeatureLocation(ExactPosition(1580100), ExactPosition(1583137), strand=-1), type='rRNA'), 
SeqFeature(FeatureLocation(ExactPosition(1583588), ExactPosition(1585127), strand=-1), type='rRNA'),
 SeqFeature(FeatureLocation(ExactPosition(2154392), ExactPosition(2157429), strand=-1), type='rRNA'),
 SeqFeature(FeatureLocation(ExactPosition(2157880), ExactPosition(2159419), strand=-1), type='rRNA'),
 SeqFeature(FeatureLocation(ExactPosition(2161512), ExactPosition(2164549), strand=-1), type='rRNA'),
 SeqFeature(FeatureLocation(ExactPosition(2165000), ExactPosition(2166538), strand=-1), type='rRNA')]

These are the coordinates of every 23S and 16S rRNA gene. As can be seen, there's more than a single locus for rRNA cluster. All I want to do is to extract the intergenic sequence between the end of the 23S rRNA and the 16S rRNA. Using coordinates of the first match reported, between the coordinates 1583137and 1583588. This needs to be iterate through every rRNA match: so also between 2157429 and 2157880, and between 2164549 and 2165000.

ADD REPLY • link 5.3 years ago by Shred ★ 1.4k

0

Entering edit mode

Of those 6 features, which pairs of 23S and 16S do you want to get sequence between? The closest pairs?

ADD REPLY • link 5.3 years ago by Joe 21k

0

Entering edit mode

Yeah. Just to be clear, here's a screenshot from Artemis, where you can see the typical distribution of these gene into every bacterial genome. (yellow one is the 23S and green one is the 16S). So yes, closest pair is what exactly I need to extract. [genome]

ADD REPLY • link 5.3 years ago by Shred ★ 1.4k

1

Entering edit mode

Ok great, the image helps. It shouldnt be too hard to do this, though I haven't got time right now. I see 2 approaches off the top of my head though, if you want to code something in the mean time:

Use the locus tags to identify adjacent pairs (their numbers should be different by 5 if your screen shot is anything to go by).
Iterate all pairwise combinations (itertools.combinations), and take the top 3 pairs with the minimum disparity in coordinate magnitude (which looks to be sort of the approach you were applying in your opening post?).

Once you have the coordinates of adjacent pairs, you just need to slice out the difference between the end of the lesser value, and the start of the higher value (the 3' and 5' ends of the genes).

Biopython stores all its coordinates relative to the +1 strand, even for features on the reverse strand, so this should be as simple as extracting the sequence between the closest 2 coordinates of the 4 coordinates for any pair of rRNAs. e.g.:

If a pair of rRNAs had the coordinates: ([1, 1000], [1500, 2500]) you just need the interval 1000-1500.

Hopefully my logic is clear enough (and correct!).

ADD REPLY • link 5.3 years ago by Joe 21k

1

Entering edit mode

Found a way, thanks for your precious help. I'll post the code as soon as it tested on real-life application.

ADD REPLY • link 5.3 years ago by Shred ★ 1.4k

0

Entering edit mode

Great, feel free to accept one or more answers to provide thread closure if your problem has been resolved.

ADD REPLY • link 5.3 years ago by Joe 21k

0

Entering edit mode

5.3 years ago

vkkodali_ncbi ★ 3.7k

You can use the feature_table.txt file in combination with efetch tool from Entrez Direct as shown below.

A feature table file for every annotated assembly is present in the FTP location from where you may have downloaded the genome genbank flat files from. For example, the feature table file for the assembly GCA_000165905.1 is located here: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/165/905/GCA_000165905.1_ASM16590v1/GCA_000165905.1_ASM16590v1_feature_table.txt.gz

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/165/905/GCA_000165905.1_ASM16590v1/GCA_000165905.1_ASM16590v1_feature_table.txt.gz
zgrep '[0-9]*S ribosomal RNA' GCA_000165905.1_ASM16590v1_feature_table.txt.gz \
    | awk 'BEGIN{FS="\t";OFS="\t";spos=0;epos=0;chr=0}{
        if($14~/^23S/)
            {chr=$7;spos=$9+1}
        else if($14~/^16S/)
            {epos=$8-1; print chr,spos,epos} }' \
    | while read -r chr spos epos ; 
        do 
            efetch -db nuccore -id $chr -seq_start $spos -seq_stop $epos -format fasta ; 
        done

ADD COMMENT • link 5.3 years ago by vkkodali_ncbi ★ 3.7k

0

Entering edit mode

Not a good solution, because I'm using multiple genomes, as can be seen at line 1.

ADD REPLY • link 5.3 years ago by Shred ★ 1.4k

0

Entering edit mode

I do realize that you will be doing this on multiple genomes. If anything, this will save you some disk space too because instead of downloading entire genomes in genbank flat file format, you will have to download the much smaller feature_table.txt.gz files.

If you have a list of assembly accessions of interest, you can download corresponding feature_table.txt.gz files for all of them easily as well. You can use a variation of code from this document; specifically, change the genomic.fna.gz portion to feature_table.txt.gz in the code at the end of page 3.

Once you have all of the feature table files, you would just have to loop over all of them; run the awk code to generate a three column table with chr, spos and epos; and then use efetch to download all sequences in fasta format.

ADD REPLY • link 5.3 years ago by vkkodali_ncbi ★ 3.7k

0

Entering edit mode

I'm downloading genomes using keywords from assembly. Thanks for that, I'll have a look to the linked document, but I guess is still a bad answer. I'm using a tool like Biopython, intended to be used to do tasks with designed object and class, and apparently there's no reason why it does the work badly. I'm using Python because of cross-platform solution, and the provided solution could be done just with *NIX distribution.

ADD REPLY • link 5.3 years ago by Shred ★ 1.4k

0

Entering edit mode

Fair enough. Just for the sake of others who may end up here at a later point in time, one can use Biopython to query Entrez databases, get the FTP path, download the feature table file, parse it using Python and then use Biopython Entrez to download the FASTA sequences (see http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc116).

ADD REPLY • link 5.3 years ago by vkkodali_ncbi ★ 3.7k

score 1 · Accepted Answer · 2019-01-17

Many many thanks for this solution goes to @jrj.healey . Apparently the logic of features selection is the same, so I suddenly don't realize why this selection works instead of the previous.

for f in glob.glob(cwd+'/genomes/*.gbk'):
    gbank = SeqIO.parse(open(f,'r'), "genbank")
    with open(output_file, "w") as nfh:
        for genome in gbank:
            # trait for ITS located onto reverse strand
            rnas_rev = [feat for feat  in genome.features if feat.type == 'rRNA' and feat.location.strand == -1]
            s_rev = [ r for r in rnas_rev if '16S' in r.qualifiers['product'][0]]
            lo_rev = [ r for r in rnas_rev if '23S' in r.qualifiers['product'][0]]
            # trait for ITS located onto forward strand
            rnas_fw = [ feat for feat  in genome.features if feat.type == 'rRNA' and feat.location.strand == -1]
            s_fw = [ r for r in rnas_fw if '16S' in r.qualifiers['product'][0]]
            lo_fw = [ r for r in rnas_fw if '23S' in r.qualifiers['product'][0]]
                # i select 16S genes
                # j select 23S
            for i,j in zip(s_rev,lo_rev):
                ....
            for i,j in zip(s_fw,lo_fw):
                ....