Get gene length with R
1
1
Entering edit mode
5.9 years ago
user31888 ▴ 130

How to get the gene length from a list of Ensembl IDs using biomaRt for instance (or any R-based method without having to download a separate annotation file first)?

Since gene_length attribute does not exist in biomaRt, is there a better alternative than using start_position and end_position attributes, then substracting the 2 values like as follows:

library(biomaRt)
ensembl_list <- c("ENSG00000000003","ENSG00000000419","ENSG00000000457","ENSG00000000460")
human <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
start_pos = getLDS(attributes = "ensembl_gene_id", filters = "ensembl_gene_idl", values = ensembl_list , mart = human, attributesL = "start_position", martL = human, uniqueRows=T)
end_pos = getLDS(attributes = "ensembl_gene_id", filters = "ensembl_gene_idl", values = ensembl_list , mart = human, attributesL = "end_position", martL = human, uniqueRows=T)
gene_L <- merge(start_pos, end_pos, by.x="Gene.stable.ID", by.y="Gene.stable.ID")
gene_L$Length <- gene_L$Gene.end..bp. - gene_L$Gene.start..bp.
R • 23k views
ADD COMMENT
0
Entering edit mode

end_position - start_position is potentially wrong due to splicing. Introns should probably not get counted, although you did not explain your application.

ADD REPLY
0
Entering edit mode

You are right. I have a numeric gene expression matrix (in CPM) that I want to convert into FPKM. That's why I was looking for a way to get gene length. So do you think taking introns into account would matter here?

ADD REPLY
0
Entering edit mode

Absolutely.

ADD REPLY
0
Entering edit mode

The EDASeq getGeneLengthAndGCContent indeed takes exons (see line 109 of the code here). Because the CPM matrix was generated with HTSeq-count, I think I should use EDASeq and skip the intron, no? Just to fit with the same method.

ADD REPLY
0
Entering edit mode

My understanding of gene is represented by this pic: https://upload.wikimedia.org/wikipedia/commons/5/54/Gene_structure_eukaryote_2_annotated.svg. esp DNA part. I guess OP requirement is total length of exons (all possible exons). This is different from gene length (IMO). Gene length from NCBI/Ensembl atleast cover all known transcripts (for gene). Gene length calculated by EDAseq doesn't make sense to me esp calling it gene length. So please take whatever is suitable for analysis. Code is provided for either case.

ADD REPLY
0
Entering edit mode

EDASeq is an overkill for this task, (if someone is interested only in gene lengths), here are a few alternatives.

https://bioinformatics.stackexchange.com/questions/4942/finding-gene-length-using-ensembl-id

Especially, the answer regarding the GenomicFeatures library.

ADD REPLY
5
Entering edit mode
5.9 years ago
library (EDASeq)
ensembl_list <- c("ENSG00000000003","ENSG00000000419","ENSG00000000457","ENSG00000000460")
getGeneLengthAndGCContent(ensembl_list, "hsa")

Connecting to BioMart ...
Downloading sequences ...
                length        gc
ENSG00000000003   4535 0.3995590
ENSG00000000419   1207 0.4159072
ENSG00000000457   6883 0.4117391
ENSG00000000460   5967 0.4298643

@OP: I think code is unnecessarily complex (IMO) for the logic in OP and it can be shortened further by:

> library(biomaRt)
> ensembl_list <- c("ENSG00000000003","ENSG00000000419","ENSG00000000457","ENSG00000000460")
> human <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
> gene_coords=getBM(attributes=c("hgnc_symbol","ensembl_gene_id", "start_position","end_position"), filters="ensembl_gene_id", values=ensembl_list, mart=human)
> gene_coords$size=gene_coords$end_position - gene_coords$start_position
> gene_coords

  hgnc_symbol ensembl_gene_id start_position end_position   size
1      TSPAN6 ENSG00000000003      100627109    100639991  12882
2        DPM1 ENSG00000000419       50934867     50958555  23688
3       SCYL3 ENSG00000000457      169849631    169894267  44636
4    C1orf112 ENSG00000000460      169662007    169854080 192073

However these number are not matching with those from EDASeq. I checked the coordinates of gene TSPAN6 ( ENSG00000000003) on NCBI. In NCBI gene (7105) start and stop respectively are: 100627108 and 100637104. Start coordinate on NCBI is off by 1 base from biomart results and stop coordinate is farther away (on chromosome x) compared to end position on NCBI. When viewed (the same gene) in IGV (hg38), gene start and stop positions are: chrX:100,625,108-100,639,104. In general, for a given gene, IGV display coordinates are 2kb up and down stream of gene start and end. Thus hg38 coordinates (from IGV) and GRCh 38 match (quick way). EDAseq probably is doing sum of exon lengths and calculating GC content based on that.

ADD COMMENT
0
Entering edit mode

Since I am looking for a way to convert CPM to FPKM (the gene length would be scaled to 10^6), few bases difference between EDASeq and NCBI/IGV may not be a big change.

ADD REPLY

Login before adding your answer.

Traffic: 3065 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6