Biostar Beta. Not for public use.
How to add biological annotations using R
1
Entering edit mode
14 months ago
neranjan • 20
US

Hi All,

Is there a R package that I can use to get the biological annotations?

I am using NOISeq package to analyze RNASeq data, and I am interested in adding additional biological annotations such as:

  • feature length
  • GC content
  • biological classification features
  • chromosome position

Question: Is it possible to direct me to how to get these information ? Can you let me know which packages that I can use to get the information.

Thank you very much.

ADD COMMENTlink
0
Entering edit mode

Thank you I was able to use biomaRt package to grab the required sequences using:

g_ids=c("ENSG00000177757", "ENSG00000187634", "ENSG00000188976")
human <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
gene_coords=getBM(attributes=c("ensembl_gene_id", "start_position","end_position","percentage_gene_gc_content"), filters="ensembl_gene_id", values=g_ids, mart=human)
gene_coords$length=gene_coords$end_position - gene_coords$start_position

Which gave me:

  ensembl_gene_id start_position end_position percentage_gene_gc_content length
1 ENSG00000177757         817371       819837                      48.52   2466
2 ENSG00000187634         923928       944581                      66.58  20653
3 ENSG00000188976         944204       959309                      59.55  15105

But the gene start and end positions as well as length of gene is well away from what it should be as shown below: (where i am following the NOISeq tutorial)

                Length    GC        Biotype Chromosome GeneStart GeneEnd
ENSG00000177757 2464.0 48.58        lincRNA          1    742614  745077
ENSG00000187634 4985.0 65.99 protein_coding          1    850393  869824
ENSG00000188976 3870.5 59.55 protein_coding          1    869459  884494
ADD REPLYlink
0
Entering edit mode

You are using GRCh38, NOISeq tutorial GRCh37?

ADD REPLYlink
0
Entering edit mode

it is possible, but in NOISeq it does not say which it is been using.

But again, the difference is huge, so it is obvious that I am missing something in calculating the gene-length.

I also tried to calculate the mean, median and sum of transcript length, and which was way off than the number they provided in NOISeq

  transcript_length.median transcript_length.mean transcript_length.sum
1                 1947.000               1947.000              1947.000
2                 2159.000               1699.353             28889.000
3                 1244.500               1749.000             10494.000
ADD REPLYlink
1
Entering edit mode

I think your tutorial is very old. Add "version=75" to your useMart command, and you'll see gene coordinates close to the tutorials.

ADD REPLYlink
1
Entering edit mode

As a side note, you calculate the total length of the gene (exons and introns). I guess NOISeq is using only sum of exon length.

ADD REPLYlink
0
Entering edit mode

Thanks @swbarnes2 and @b.nota

I was able to replicate the NOISeq data length using a older version of mart, and also using transcript start and end locations to calculate the length.

So my final code would be:

en_2=useMart("ensembl", host = "http://feb2014.archive.ensembl.org")
human_2 = useDataset("hsapiens_gene_ensembl",mart=en_2)

g_ids_t=c("ENSG00000177757", "ENSG00000187634", "ENSG00000188976")
gene_coords=getBM(attributes=c("ensembl_gene_id", "start_position","end_position", "transcript_start", "transcript_end"), filters="ensembl_gene_id", values=g_ids_t, mart=human_2)
gene_coords$t_length = abs(gene_coords$transcript_start - gene_coords$transcript_end)

t_median_2 <- aggregate( t_length ~ ensembl_gene_id, gene_coords, plyr::each(median))

which gave me:

  ensembl_gene_id t_length
1 ENSG00000177757   2463.0
2 ENSG00000187634   4984.0
3 ENSG00000188976   3869.5

So I can take this as an answer.

As a side note: Which is off-set by 1 in each case, to the NOISeq data set: ( why ?)

                Length    GC        
ENSG00000177757 2464.0 
ENSG00000187634 4985.0 
ENSG00000188976 3870.5
ADD REPLYlink
2
Entering edit mode

The off-set of one is probably because of the one-based coordinate system used by ensembl. Add + 1 to your equation.

ADD REPLYlink
0
Entering edit mode

What are the three most common sources of programming errors? Poor naming, and 1-off errors.

ADD REPLYlink
2
Entering edit mode
11 months ago
Benn 6.9k
Netherlands

I think biomaRt can get you far, please read the manual.

ADD COMMENTlink
1
Entering edit mode
13 months ago
Ashastry • 60

BiomaRt is my first option. Although, I occasionally use AnnotationDBI package.

ADD COMMENTlink
0
Entering edit mode
14 months ago
neranjan • 20
US
en_2=useMart("ensembl", host = "http://feb2014.archive.ensembl.org")
human_2 = useDataset("hsapiens_gene_ensembl",mart=en_2)

So as a last question to close this , is there a way to save this "mart" object for later use. (or as a reference for later analysis) ?

ADD COMMENTlink
0
Entering edit mode

mart<-useMart(biomart ="ensemble", dataset = "")

ADD REPLYlink
0
Entering edit mode

not the assignment operator !!!

What I meant is to save it as object for later use, so I can load it when I do need it. (since these versions change what they have with time)

So is there a way to save this object in my hard drive? (some sort of similar way as they do in TxDb objects)

samplefile <- system.file("extdata", "hg19_knownGene_sample.sqlite",
package="GenomicFeatures")
txdb <- loadDb(samplefile)
ADD REPLYlink
1
Entering edit mode

Save the R session:

save.image("Analysis.Rdata")

load it again another time:

load("Analysis.Rdata")
ADD REPLYlink
0
Entering edit mode

Thanks, I think i was looking at more of only saving the ensemble database, and I think it can be done using:

rat_ensembl = useDataset("rnorvegicus_gene_ensembl",mart=ensembl)
save(rat_ensembl, file = "rat_ensembl.RData")

Then loading using:

load("rat_ensembl.RData")
ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1