Biostar Beta. Not for public use.
Python Regular Expressions to extract the transcript ID from lines of GTF file
0
Entering edit mode
3.4 years ago
oars • 150
@oars41179

I'm sure there is a smart and simplistic way to do this but I'm stuck. I simply want to extract the transcript_id field using the import re (re.findall) from the lines in the GTF (FASTA homo_sapiens). This is what I have so far:

import re
f = open ('Homo_sapiens.GRCh38.89.gtf', 'r')
# Feed the files into findall(); it returns a list of all the found strings
string = re.findall(r'transcript_id, f.read())
     print transcript_id

Where did I go wrong?

python transcript_id GTF • 1.8k views
ADD COMMENTlink
2
Entering edit mode
3.4 years ago
@Devon Ryan7403

It turns out to be really hard (and error-prone) to come up with a good regular expression to parse that out. In deepTools, I use the following process:

  1. Split each line into columns by tab (cols = line.strip().split("\t")
  2. Use the csv module to parse the last column (s = next(csv.reader([cols[8]], delimiter=' '))).
  3. Get the column after transcript_id (s[s.index('transcript_id') + 1].rstrip(";"))

You can see an example of that here, which is the python module I wrote for deepTools to read BED/GTF files into a custom interval tree.

ADD COMMENTlink
0
Entering edit mode

Actually I do like this approach, thank you to share Devon. Your solution is a bit slower than regex, but definitively safer. Thanks!

ADD REPLYlink
0
Entering edit mode
3.4 years ago
oars • 150
@oars41179

Many thanks Devon! Is it even possible to parse out transcript_id's with a regular expression (re.match, re.search, or re.findall)?

ADD COMMENTlink
0
Entering edit mode

Please note the "ADD COMMENT" button.

To your question, perhaps? I never got one to work with all of my weird test cases, but I suspect that's simply because I didn't try hard enough. The method I posted works with every weird case I could come up with, so I just left it at that.

ADD REPLYlink
0
Entering edit mode
3.4 years ago
oars • 150
@oars41179

I found this code on stack exchange;

> if re.findall(r'transcript_id=[^\s]+',line):
> transcript = re.findall(r'transcript_id=[^\s]+',line)[0]

> else:

>   transcript = "NA"

I don't understand the [^\s]+',line): portion of the code - what does this section do?

Would this work?

f = open ('Homo_sapiens.GRCh38.89.gtf', 'r')
>>> if re.findall(r'transcript_id=[^\s]+',line):
>>>transcript = re.findall(r'transcript_id=[^\s]+',line)[0]

...else:

>   transcript = "NA"
ADD COMMENTlink
0
Entering edit mode

That won't work. transcript_id=[^\s]+ will find all instances of transcript_id= followed by white-space. The \s should presumably be a \S (so, non-whitespace). Regardless, that won't work because (1) transcript IDs can contain white space and (2) there are no transcript_id= instances in a GTF file. That's the annoying part about transcript IDs (and gene names and such), they can contain anything, including quotes and delimiters.

ADD REPLYlink
0
Entering edit mode
3.4 years ago
glihm • 600
@glihm19968

First, the @Devon Ryan solution is something you have to consider as he already spent time to wrote and check this code. He is using the CSV module to ensure that the split is correctly done (taking in account the double quote or no, etc...).

If you still with your idea of Python REGEX, for sure it's possible.

1) Revise the format you want to parse: GTF (link to format description) 2) You want to extract an information from the last column (attribute column), which can be composed by several tags, separated by a semi-column (';'). 3) Now you now want you want, you can use the Python REGEX to build the extraction code from the pseudocode like "{transcript_id}{\space}{id_value};", which gives you in python the following regex:

"transcript_id\s([^;]+);?"

where:

transcript_id => the tag you want to find

\s => represents a space, as in GTF format, the attributes are like this: "TAG{SPACE}VALUE;

([^;]+) => We want to extract ALL the characters EXCEPT the semi-column as the semi-column is the attribute separator. Also, we use the parenthesis to tell to Python "I want to extract this information". In the version you pasted, they use [^\s] to say that spaces are not allowed in the ID.

;? => In REGEX, using '?' tells that I don't know if this character is present. Sometimes, the end of the GTF line doesn't contain this semi-column (which is mandatory). So, with ";?", python will check if the ';' if found or not (for the last field).

So, from this piece of information, you can write a python function to extract the transcript_id from a gff line:

#!usr/bin/env python3                                                                                                                                                                                                                        

# Stdlib Python3                                                                                                                                                                                                                             
import re

# Constants                                                                                                                                                                                                                                  
GFF_SEP = "\t"

# Create a function to extract the transcript_id from a gff line.                                                                                                                                                                            
# We assume that there is only ONE transcript_id per line.                                                                                                                                                                                   
def get_transcript_id(gffline):
    """ Returns the transcript_id (str) in the gffline.                                                                                                                                                                                      
    None if transcript_id not found.                                                                                                                                                                                                         
    """
    # We first extract the attribute field (the last one) as you know that your tag is here
    attribute_field = gffline.strip().split(GFF_SEP)[-1]
    regex_pattern = re.compile("transcript_id\s([^;]+);?")
    regex_results = regex_pattern.search(attribute_field)

    try:
        return regex_results.group(1)
    except AttributeError:
        return None


def main():
    """                                                                                                                                                                                                                                      
    """
    one_gff_line = """1\tprocessed_transcript\ttranscript\t11869\t14409\t.\t+\t.\tgene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; gene_sourc e "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana";"""

    tr_id = get_transcript_id(one_gff_line)
    if tr_id:
        print("transcript_id extracted = %s" % (tr_id))
    else:
        print("transcript_id tag not found.")

if __name__ == "__main__":
    main()

And as you can see, this function can easily be extended to match any tag you want, by changing the function signature adding an other parameter: "tag_name". ;)

I hope this helps and feel free to ask if something is not clear.

ADD COMMENTlink
0
Entering edit mode

I hope there are no transcript IDs (or anything else that you want to extract) that contain a semi-colon...

ADD REPLYlink
0
Entering edit mode

That's a good comment, usually that's right I don't have any ID with ';' inside them. That's right, I was reviewing your code and it's more general as you don't have to care about the ID content. :)

ADD REPLYlink
0
Entering edit mode
3.4 years ago
Macspider ♦ 2.8k
@Macspider33885

One comment only (I cannot add anything to what was already discussed here above):

GTF format files SHOULD have, in field 9, first "gene_id" and then "transcript_id". If the file format is respected, which you should assume so in the first place, doing a simple:

line.rstrip("\b\r\n").split("\t")[8].split("; ")[1].split(" ")[1].strip("\"")

Will ensure that, from each line, you extract the transcript_id because it should always be the second item of the 9th field, after gene_id.

Of course, if you don't trust the format to be respected, using re.search might be a good choice.

ADD COMMENTlink
1
Entering edit mode

Note that the = is particular to GFF and isn't present in GTF.

ADD REPLYlink
1
Entering edit mode

Good catch. Edited.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.3