Mapping From Gene Names To Gene Lengths In R?
4
2
Entering edit mode
11.2 years ago
John St. John ★ 1.2k

I have a list of genes, and their ENTREZ and HUGO Ids, along with raw counts of RNA-seq reads mapping to those genes. I want to come up with a quick approximation of either RPKM or TPM.

To do that all I am missing is the length of each gene. Is there a quick way to get at this information in R? Maybe something like mapping ENTREZ to RefSeq ID, and then grabbing the length of whatever the first RefSeq transcript happens to be for each gene?

Thanks!

r gene length entrez • 11k views
ADD COMMENT
9
Entering edit mode
11.2 years ago
lefebvrf ▴ 100

Quick and dirty total exonic lengths, leveraging GenomeFeatures and the Hs org package:

  hg19GeneLengths <- function(symbols)
    {
        require(TxDb.Hsapiens.UCSC.hg19.knownGene) 
        require(org.Hs.eg.db)
        exons.db = exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by='gene')    
        egs    = unlist(  mget(symbols[ symbols %in% keys(org.Hs.egSYMBOL2EG) ],org.Hs.egSYMBOL2EG) )
        sapply(egs,function(eg)
        {
            exons = exons.db[[eg]]
            exons = reduce(exons)
            sum( width(exons) )
        })
    }
    hg19GeneLengths( c('STAT1','CXCL10','ACTB','PDCD1') )


 STAT1 CXCL10   ACTB  PDCD1 
  4943   1216   2253   2112
ADD COMMENT
2
Entering edit mode
11.2 years ago
David W 4.9k

John,

You could consider rentrez, which is a pretty low-level R library for the entrez API (which I wrote!).

In this case you would want to find links to other databases from your gene IDs, and then get the summary file for the refseq record:

library(rentrez)

links <- entrez_link(db = "all", ids = 353174, dbfrom = "gene")
names(links)


##  [1] "gene_bioproject"            "gene_biosystems_all"       
##  [3] "gene_cdd"                   "gene_dbvar"                
##  [5] "gene_genome"                "gene_geoprofiles"          
##  [7] "gene_gtr"                   "gene_homologene"           
##  [9] "gene_nuccore"               "gene_nuccore_mgc"          
## [11] "gene_nuccore_pos"           "gene_nuccore_refseqrna"    
## [13] "gene_nucest"                "gene_nucest_clust"         
## [15] "gene_nucleotide"            "gene_nucleotide_clust"     
## [17] "gene_nucleotide_mgc"        "gene_nucleotide_mgc_url"   
## [19] "gene_nucleotide_pos"        "gene_omim"                 
## [21] "gene_pcassay"               "gene_pcassay_proteintarget"
## [23] "gene_pccompound"            "gene_pcsubstance"          
## [25] "gene_pmc"                   "gene_probe"                
## [27] "gene_protein"               "gene_protein_refseq"       
## [29] "gene_pubmed"                "gene_pubmed_citedinomim"   
## [31] "gene_pubmed_rif"            "gene_snp"                  
## [33] "gene_snp_genegenotype"      "gene_taxonomy"             
## [35] "gene_unigene"               "gene_unists"               
## [37] "file"

Once you have ID for the refseq record, you can fetch an (unparsed) XML file with entrez_summary:

e_sum <- entrez_summary(db = "nuccore", id = links$gene_nuccore_refseqrna)
print(e_sum)

## 
## http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSummary_041029.dtd">
## <eSummaryResult>
##   <DocSum>
##     <Id>206725455</Id>
##     <Item Name="Caption" Type="String">NM_180990</Item>
##     <Item Name="Title" Type="String">Homo sapiens zinc activated ligand-gated ion channel (ZACN), mRNA</Item>
##     <Item Name="Extra" Type="String">gi|206725455|ref|NM_180990.3|[206725455]</Item>
##     <Item Name="Gi" Type="Integer">206725455</Item>
##     <Item Name="CreateDate" Type="String">2003/05/15</Item>
##     <Item Name="UpdateDate" Type="String">2012/11/11</Item>
##     <Item Name="Flags" Type="Integer">512</Item>
##     <Item Name="TaxId" Type="Integer">9606</Item>
##     <Item Name="Length" Type="Integer">1473</Item>
##     <Item Name="Status" Type="String">live</Item>
##     <Item Name="ReplacedBy" Type="String"/>
##     <Item Name="Comment" Type="String"></Item>
##   </DocSum>
## </eSummaryResult>
##

Presmusing this is the process you would go through, then wrapping it up in a function (and using Xpath to extract information from the XML) is straightforward:

get_entrez_length <- function(e_id) {
    links <- entrez_link(db = "nuccore", ids = e_id, dbfrom = "gene")
    summary_xml <- entrez_summary(db = "nuccore", id = links$gene_nuccore_refseqrna)
    len <- xpathSApply(summary_xml, "//Item[@Name='Length']", xmlValue)
    return(as.integer(len))
}

get_entrez_length(353174)

## [1] 1473

If you do use rentrez and have any questions/suggestions/bugs let me know - still developing it so happy to make it more useful.

ADD COMMENT
0
Entering edit mode
11.2 years ago
ed.liaw ▴ 90

Hey John :D

I don't have an exact solution for you, but maybe this information would help (wouldn't necessarily be quick though):

It looks like someone was coding up an R interface to eutils: http://code.google.com/p/reutils/source/browse/branches/lite/eutils.r?r=24 . I looks to me like esearch and esummary are functional enough to take queries into an XML tree. Might be more of a hassle than its worth, though.

ADD COMMENT
0
Entering edit mode
11.2 years ago
Irsan ★ 7.8k

biomaRt?

ADD COMMENT

Login before adding your answer.

Traffic: 2721 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6