Different results from biomaRt query and search in GTF
1
0
Entering edit mode
5.1 years ago
lizaveta • 0

I need all the exons coordinates for a specific gene transcript (let's say, "ENST00000269305"). I try to do it with 2 different methods:

  1. using getBM from biomaRt

    library("biomaRt")
    human = useMart(biomart="ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl",host="grch37.ensembl.org", path="/biomart/martservice",ensemblRedirect = FALSE)
    transcript_info = data.frame(unique(getBM(attributes = c("chromosome_name", "genomic_coding_start","genomic_coding_end"), 
                               filters="ensembl_transcript_id", values="ENST00000269305",mart = human)))
    transcript_info[order(transcript_info$genomic_coding_start),]

    chromosome_name genomic_coding_start genomic_coding_end 11 17 7572927 7573008 10 17 7573927 7574033 7 17 7576853 7576926 6 17 7577019 7577155 5 17 7577499 7577608 4 17 7578177 7578289 3 17 7578371 7578554 2 17 7579312 7579590 1 17 7579700 7579721 9 17 7579839 7579912 8 17 NA NA

  2. my function, that uses GTF file (downloaded from Ensembl, filtered to have only protein-coding genes)

    library(rtracklayer)
    library(GenomicRanges)
    gtf = import.gff("~/Documents/data/grch37/release-95/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.ProteinCodingGenes_filtered.gtf")
    gtf_exons <- gtf[gtf$type == "exon"]
    transcript_info = gtf_exons[gtf_exons$transcript_id == "ENST00000269305"]
    transcript_info = data.frame(transcript_info[,c()])
    transcript_info = transcript_info[,c(1:3)]
    colnames(transcript_info) = c("chromosome_name", "genomic_coding_start","genomic_coding_end")
    transcript_info[order(transcript_info$genomic_coding_start),]

    chromosome_name genomic_coding_start genomic_coding_end 11 17 7571720 7573008 10 17 7573927 7574033 9 17 7576853 7576926 8 17 7577019 7577155 7 17 7577499 7577608 6 17 7578177 7578289 5 17 7578371 7578554 4 17 7579312 7579590 3 17 7579700 7579721 2 17 7579839 7579940 1 17 7590695 7590856

Surprisingly, it gives me different results although some exons are shared. Manual checking with UCSC Genome Browser (hg19) supports the second script. What can be the reason that biomaRt fails to obtain some exons?

R transcripts biomaRt annotation hg19 • 1.9k views
ADD COMMENT
4
Entering edit mode
5.1 years ago
Emily 23k

The difference is the three exons:

  • ENSE00001146308, Exon 1, NA (BioMart) / 7590695 - 7590856 (GTF)
  • ENSE00002667911, Exon 2, 7579839 - 7579912 (BioMart) / 7579839 - 7579940 (GTF)
  • ENSE00003605891, Exon 11, 7572927 - 7573008 (BioMart) / 7571720 - 7573008 (GTF)

With BioMart, you're getting the coding regions of the exons, whereas you're just getting the coordinates of the whole exons from the GTF. If you look at the transcript you'll see that the 5' UTR spans all of exon 1 (so no coding region) and the first third of exon 2 (so the coding region is different to the region of the whole exon). Most of exon 11 is 3' UTR (again, different coordinates for the coding region compared to the whole exon).

ADD COMMENT
0
Entering edit mode

Thank you for the explanation! I believe, as I don't want to depend on external databases (I've been having a lot of trouble while looking for the coordinates of many genes via getBM due to ofter Ensembl mirrors shut-downs) I should use CDS + stop_codon mask from GTF file to get the same sets of coordinates as Ensembl provides.

ADD REPLY
1
Entering edit mode

You can also use the ensembldb package (https://bioconductor.org/packages/release/bioc/html/ensembldb.html) which will let you use a offline version of the Ensembl data (so no network issues) but in a database format you can construct all sorts of queries for.

ADD REPLY

Login before adding your answer.

Traffic: 2655 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6