Question

how to get annotations using gene ids in python

0

Entering edit mode

9.1 years ago

vineelagangalapudi • 0

I have a augustus output file, which I further edited and now it looks like this.

gdna.0.1
gnl|Hsal_3.3|scaffold48 AUGUSTUS        gene    3912    10875   0.01    -       .       ID=g1
gnl|Hsal_3.3|scaffold48 AUGUSTUS        transcript      3912    10875   0.01    -       .       ID=g1.t1;Parent=g1
gnl|Hsal_3.3|scaffold48 AUGUSTUS        transcription_end_site  3912    3912    .       -       .       Parent=g1.t1
gdna.0.2

gnl|Hsal_3.3|scaffold48 AUGUSTUS        exon    3912    5899    .       -       .       Parent=g1.t1

gnl|Hsal_3.3|scaffold48 AUGUSTUS        stop_codon      4604    4606    .       -       0       Parent=g1.t1

gnl|Hsal_3.3|scaffold48 AUGUSTUS        CDS     4604    5656    0.89    -       0       ID=g1.t1.cds;Parent=g1.t1
gnl|Hsal_3.3|scaffold48 AUGUSTUS        start_codon     5654    5656    .       -       0       Parent=g1.t1

gdna.0.3

gnl|Hsal_3.3|scaffold48 AUGUSTUS        exon    9919    10875   .       -       .       Parent=g1.t1
gnl|Hsal_3.3|scaffold48 AUGUSTUS        transcription_start_site        10875   10875   .       -       .       Parent=g1.t1

gdna.0.4
gnl|Hsal_3.3|scaffold112        AUGUSTUS        gene    9196    31470   0.01    +       .       ID=g2
gnl|Hsal_3.3|scaffold112        AUGUSTUS        transcript      9196    31470   0.01    +       .       ID=g2.t1;Parent=g2
gnl|Hsal_3.3|scaffold112        AUGUSTUS        transcription_start_site        9196    9196    .       +       .  Parent=g2.t1

gdna.0.5

gnl|Hsal_3.3|scaffold112        AUGUSTUS        exon    9196    9505    .       +       .       Parent=g2.t1
gnl|Hsal_3.3|scaffold112        AUGUSTUS        exon    9623    10706   .       +       .       Parent=g2.t1
gnl|Hsal_3.3|scaffold112        AUGUSTUS        exon    10757   11547   .       +       .       Parent=g2.t1
gnl|Hsal_3.3|scaffold112        AUGUSTUS        exon    12892   13859   .       +       .       Parent=g2.t1

I have a list of gene ids like this:

gdna.0.2
gdna 0.4

I want to get the annotations of these gene ids alone into a new file from the augustus output file I specified above, I initially wanted to use a awk to do this, but since the input file is huge and also the number of gene ids for which I needed the annotation entries were also huge, I understood its best to write a script.

The output should look like this

gdna.0.2

gnl|Hsal_3.3|scaffold48 AUGUSTUS        exon    3912    5899    .       -       .       Parent=g1.t1
gnl|Hsal_3.3|scaffold48 AUGUSTUS        stop_codon      4604    4606    .       -       0       Parent=g1.t1
gnl|Hsal_3.3|scaffold48 AUGUSTUS        CDS     4604    5656    0.89    -       0       ID=g1.t1.cds;Parent=g1.t1
gnl|Hsal_3.3|scaffold48 AUGUSTUS        start_codon     5654    5656    .       -       0       Parent=g1.t1

gdna.0.4
gnl|Hsal_3.3|scaffold112        AUGUSTUS        gene    9196    31470   0.01    +       .       ID=g2
gnl|Hsal_3.3|scaffold112        AUGUSTUS        transcript      9196    31470   0.01    +       .       ID=g2.t1;Parent=g2
gnl|Hsal_3.3|scaffold112        AUGUSTUS        transcription_start_site        9196    9196    .       +       .  Parent=g2.t1

I tried using dictionaries for this purpose, but have no luck in moving further after this.

from collections import defaultdict #this will make your life simpler
f = open('test.txt','r')
list=defaultdict(str)
name = ''
for line in f:
    if line.startswith('gdna.'):
        name = line[1:-1]
        continue 
    list[name]+=line.strip()
print list

Can anyone suggest if there is a more effective way to do this?

Thanks

genome • 3.6k views

ADD COMMENT • link updated 22 months ago by Ram 43k • written 9.1 years ago by vineelagangalapudi • 0

Ram · Answer 1 · 2015-04-08

This is a great attempt at solving the problem, you're really close to a solution. Just iterate through the dictionary and print it:

for gene_id, lines in list.iteritems():
    print gene_id
    print lines

This would print gene_ids in arbitrary order though (as dictionaries have no order, have a look at OrderedDict in collections package, or simply store the order of gene_ids as you see them, if that is important).

Now, some educational advice for you:

Looking at your code, you should never be calling a python variable "list" as that shadows a builtin type (list). If you really really really want to call something "a list", then use "list_" (yes, with a trailing underscore) as the PEP8 standard suggests (which every one of you python coders should have read, by the way).

Lastly, if you write lines like line[1:-1] add a comment what you were trying to do with this slice of line -- I, personally, have no idea.

Most importantly, as Alex pointed out you could just get away without storing anything in memory by just walking through the file with an on/off switch. He is wrong about using the dictionary for the filtered gene IDs though, you should be using a set data type and use the python 'in' keyword. For completeness I also added the filename as a CONSTANT as this is a good practice. I also used the yield keyword (which is not necessary in this case since printing already works like yield) but useful if in the future you would want to do something similar that does not involve printing look it up.

Now, here's your solution below. Don't be alarmed, it is lengthy because I added lots of comments trying to explain why things are done the way they are. If you remove them it is as succinct as yours!

import sys

# Some constants (note the CAPITAL_CASING for constants -- PEP8)
INPUT_FILENAME = 'test.txt'  
INTERESTING_GENE_IDS = {'gdna.0.2', 'gdna.0.4'} # Note how this is a set

def filter_file_for_gene_ids(file_, ids_to_keep):
    yield_current_line = False  # binary switch as Alex suggested    
    for line in file_:
        # Similar to what you have done
        if line.startswith('gdna.'):
            gene_id = line.strip() # Removes the \r\n and other spaces at the end

            yield_current_line = gene_id in INTERESTING_GENE_IDS     # isn't python beautiful?

        # you could do a print here directly, 
        # but I am not a big fan of 'print' statements not in main() function
        # furthermore, using a generator would allow you to do something else with this line outside the function
        if yield_current_line:
            yield line  # Note that this also yields the lines starting with 'gdna.'!

# Not strictly necessary, but having your main() method as a function makes life easier
# by not polluting the global namespace, should you ever want to import anything form this script
def main():

    # the with .. statement is a context manager -- look it up, makes life easier 
    # also note how I add _ at the end of "file" as not to shadow built-in name
    with open(INPUT_FILENAME) as file_:

        for filtered_line in filter_file_for_gene_ids(file_, INTERESTING_GENE_IDS):
            # Since the filter_file_for_gene_ids returns a generator, print the lines ourselves here
            print filtered_line 

if __name__ == '__main__':
    main()

Ram · Answer 2 · 2015-04-08

Don't keep a dictionary of gene IDs and all associated lines. That's expensive in time and memory. Instead, just keep a simple dictionary of gene IDs that you're interested in filtering for.

I'm presuming you are using Python. But a better solution can be written in any language.

Set a flag variable called print_current_line (for example) to False. You only print lines of your annotation file when this value is True.

When you walk through the annotation file line by line, test if the line contains one column or starts with your prefix. This is a line containing a key value you can test against your genes-of-interest dictionary.

In Python, you could use a try-except block that looks for a KeyError (no key present) or the other case, where the key exists.

If the key exists, set the print_current_line flag to True. Otherwise, set the flag to False.

Outside the dictionary lookup, print the current line to standard output if print_current_line is True. Else, do nothing and loop back up to check the next line, until EOF.

This approach lets you stream through the file and only print conditionally. Streaming will use much less time and memory than first reading the entire annotation file into a dictionary.