How Can We Find The Info For 3'Utr And 5'Utr In Gencode Gtf File?
3
4
Entering edit mode
10.7 years ago
J.F.Jiang ▴ 910

Hi all,

I want to extract the genomic region of 3'UTR and 5'UTR from the GENCODE gtf annotation file to make a bed file for my analysis.

However, it seems that the gtf file did not seperate the 3'UTR and 5'UTR, rather they report UTR instead.

Although the UCSC table browser offers us these info, it is from v14, not the latest. And I really want to know how can they obtain such information?

If anyone knows the process pipeline, could you kindly tell me the flowchart?

Thanks.

utr • 13k views
ADD COMMENT
1
Entering edit mode

Could you please provide a link that someone can download such a GTF file to experiment with? You'll get a response much faster.

ADD REPLY
9
Entering edit mode
10.6 years ago
dario.garvan ▴ 530

It's not that easy. The GTF only has a feature category called UTR. The end user has to figure out if it is a 5' or 3' UTR. Here is a reproducible example in R of categorising them. Note that some protein-coding transcripts have a 5' UTR and no 3' UTR, a 3' UTR and no 5' UTR, or have neither UTR. It is important to remember the GENCODE annotation is a collection of transcript *fragments*, not full-length models.

library(GenomicRanges) # From Bioconductor.

genes <- read.table("gencode.v17.annotation.gtf", sep = '\t', skip = 5, stringsAsFactors = FALSE)
whichCodingTranscripts <- genes[, 3] == "transcript" & grepl("transcript_type protein_coding", genes[, 9], fixed = TRUE)
proteinTranscripts <- genes[whichCodingTranscripts, ]
strands <- proteinTranscripts[, 7]
allFeaturesTranscripts <- gsub("transcript_id ", '', sapply(strsplit(genes[, 9], "; "), '[', 2))
proteinTranscriptsNames <- allFeaturesTranscripts[whichCodingTranscripts]
whichCDS <- genes[, 3] == "CDS" & allFeaturesTranscripts %in% proteinTranscriptsNames
transcriptsCDS <- genes[whichCDS, ]
transcriptsCDS <- split(GRanges(transcriptsCDS[, 1], IRanges(transcriptsCDS[, 4], transcriptsCDS[, 5]), transcriptsCDS[, 7]),
                    factor(allFeaturesTranscripts[whichCDS], levels = proteinTranscriptsNames))
firstCDS <- mapply(function(CDS, strand) {if(strand == '+') {CDS[1]} else {CDS[length(CDS)]}}, transcriptsCDS, strands)
lastCDS <-  mapply(function(CDS, strand) {if(strand == '+') {CDS[length(CDS)]} else {CDS[1]}}, transcriptsCDS, strands)
whichUTR <- genes[, 3] == "UTR" & allFeaturesTranscripts %in% proteinTranscriptsNames
transcriptsUTR <- genes[whichUTR, ]
transcriptsUTR <- split(GRanges(transcriptsUTR[, 1], IRanges(transcriptsUTR[, 4], transcriptsUTR[, 5]), transcriptsUTR[, 7]),
                    factor(allFeaturesTranscripts[whichUTR], levels = names(firstCDS)))

transcriptsUTR5 <- mapply(function(UTR, CDS, strand)
               {        
                 if(strand == '+') UTR[UTR < CDS[1]] else UTR[UTR > CDS[length(CDS)]]
               }, transcriptsUTR, firstCDS, as.list(strands), SIMPLIFY = FALSE)

transcriptsUTR3 <- mapply(function(UTR, CDS, strand)
               {        
                 if(strand == '+') UTR[UTR > CDS[length(CDS)]] else UTR[UTR < CDS[1]]
               }, transcriptsUTR, firstCDS, as.list(strands), SIMPLIFY = FALSE)
ADD COMMENT
0
Entering edit mode

Thanks, I have not updated my post, I finally used perl to extract the UTR information, and made a similar formula to get 3 and 5.

But very thanks for the R script, will compare the result with my result

ADD REPLY
0
Entering edit mode

And now we can also get this data from UCSC table browser, since it already has these info for V17

ADD REPLY
0
Entering edit mode

Not all organisms are annotated with the UTRs. Some has only xenoRefSeq tracks whereby they arbitrarily annotated the UTRs as 200bp upstream/downstream of the start/end codon.

ADD REPLY
3
Entering edit mode
10.6 years ago
Michael 54k

That shouldn't be too difficult to figure, given you have UTRs annotated.

  • 5'-UTRs are those UTRs at the 5' end of a transcript
  • 3'-UTRS are those at the 3' end of a transcript

:)

ADD COMMENT
0
Entering edit mode

I guess what OP has to do:

"transcript 3' coordinate" - "CDS 3' coordinate" = 3' UTR
ADD REPLY
1
Entering edit mode
2.2 years ago

You can use gencode_utr_fix to update your gtf. It replaces UTR features with five_prime_utr and three_prime_utr features. For example:

gencode_utr_fix --input_gtf gencode.v29.annotation.gtf --output_gtf gencode.v29.annotation_utr.gtf

ADD COMMENT

Login before adding your answer.

Traffic: 1825 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6