R -- Get Exonic Ranges Of A Gene'S Transcripts Given Its Name?
4
6
Entering edit mode
11.9 years ago
Stew ▴ 60

I am writing an R script in which I have a list of gene names (along with the chromosome and the coordinate of a single point on the gene), and I need to find the list of transcripts for that gene, and the coordinates of the exonic and intronic regions of each of those transcripts (ie exonStarts etc in the UCSC knowngene table).

I'm sure there is a way to do this with Bioconductor, I've been messing around with BSGenome, VariantAnnotation, and the TxDb.Hsapiens.UCSC.hg19.knownGene packages but am making little progress and growing frustrated by the documentation.

Any ideas? Many thanks.

r bioconductor • 8.8k views
ADD COMMENT
1
Entering edit mode

Have a look at the biomaRt package, that should do.

ADD REPLY
5
Entering edit mode
11.9 years ago
Nathan Harmston ★ 1.1k

How about the GenomicFeatures library?

library(GenomicFeatures)
x= makeTranscriptDbFromUCSC("hg19", tablename="knownGene")
saveFeatures(x, file=paste("hg19", "_ucsc.sqlite", sep=""))
fn = loadFeatures("hg19_ucsc.sqlite")

then you can easily use the exons and transcripts methods to access the information you want if you have the gene identifiers......

genes = c(10772)
names(genes) = "gene_id"
transcripts(fn, vals = genes)

gives you the transcripts for that gene identifier and then you can take those, extract the transcript names and do a similar thing with the exons method.

HTH

ADD COMMENT
0
Entering edit mode

by the way when you say you have a list of gene names I am assuming you meant UCSC gene identifiers and not Entrez gene identifiers / gene names. Otherwise you'll have to convert them first.

ADD REPLY
0
Entering edit mode

For GenomicRanges_1.18.4, saveFeatures and loadFeatures, changed to saveDb and loadDb.

ADD REPLY
0
Entering edit mode

You don't need to use makeTranscriptDbFromUCSC. Load the Homo.sapiens package, and it will come with a TxDB object for hg19.

ADD REPLY
3
Entering edit mode
11.9 years ago
Paolo ▴ 320

For this kind of problems Biomart and the Bioconductor biomaRt package are your friends!

For example if you have "entrezgene" ids for your list of genes and you need the corresponding "ensemble transcript id" related to your genes you can do something like this:

library(biomaRt)
mart <- useMart("ensembl")
mart <- useDataset("hsapiens_gene_ensembl", mart)
att <- listAttributes(mart)
f <- listFilters(mart)
attributes <- c("entrezgene","ensembl_transcript_id")
filters <- "entrezgene"
values <- c(9751,64219,5591,642636,6097,83795) #randomly selected
output <- getBM(attributes=attributes, filters=filters, values=values, mart=mart)

Check all available filters and attributes reported in the att and f variables. For a nice introduction to the package see the associated vignette.

ADD COMMENT
0
Entering edit mode

can one use these package for refSeq gene definitions as well, or limited to ensembl definitions

ADD REPLY
2
Entering edit mode
11.9 years ago
Paolo ▴ 320

Another solution is to use the geneToTranscript function from the annmap package:

# c & p from the package Vignette
library(annmap)
gene = symbolToGene( c( 'PTEN', 'SHH' ) )
geneToTranscript( gene )
ADD COMMENT
1
Entering edit mode
7.6 years ago
Michi ▴ 990

Lets get the exons for the PLEC gene in HG19: http://www.ncbi.nlm.nih.gov/gene?cmd=Retrieve&dopt=full_report&list_uids=5339

library("GenomicFeatures")
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
genome <- TxDb.Hsapiens.UCSC.hg19.knownGene

# the plec gene
plec_gene = genes(genome)[which(genes(genome)$gene_id == 5339),]

# get the exons with the gene coordinates
plec_exons = subsetByOverlaps(exons(genome), plec_gene)

# same for transcripts
plec_t = subsetByOverlaps(transcripts(genome), plec_gene)

That is about it

ADD COMMENT

Login before adding your answer.

Traffic: 2842 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6