Biostar Beta. Not for public use.
Mapping Ensembl Gene IDs with dot suffix
4
Entering edit mode
14 months ago
mk • 90

I have a bunch of bulk mRNA sequencing pulled off of the TCGA. Feature names appear to be Ensembl gene IDs with a suffix. Here is an example:

[995] "ENSG00000236246.1" "ENSG00000281088.1" [997] "ENSG00000254526.1" "ENSG00000223575.2" [999] "ENSG00000201444.1" "ENSG00000232573.1"

I am taking the intersection between these features and a set of Entrez Gene IDs. In order to do this I am using the biomaRt package to generate a mapping between Ensembl gene IDs and Entrez gene IDs. However, the only Entrez gene IDs I can find lack the suffixes. Here is the head of the table that maps Entrez genes to Ensemble genes:

  entrezgene ensembl_gene_id
1      90529 ENSG00000001460
2       9235 ENSG00000008517
3      10747 ENSG00000009724
4     654364 ENSG00000011052
5     112611 ENSG00000013392
6      57210 ENSG00000022567

Can someone explain what the Ensembl suffixes mean and how to convert these names to Entrez? If this can be done with biomaRt, it would be ideal. Thanks.

ADD COMMENTlink
0
Entering edit mode
ADD REPLYlink
8
Entering edit mode
14 months ago
EMBL-EBI

The numbers are version numbers. There is information about stable ID versioning here. You can just strip off the version numbers to use with biomaRt.

ADD COMMENTlink
3
Entering edit mode
13 months ago
Mike Smith ♦ 1.2k
EMBL Heidelberg / de.NBI

Here's an example of doing the conversion using biomaRt. You can use the versioned IDs you've got, but you'll see it's better the remove the version numbers.

First, we'll load biomaRt and use your example IDs.

library(biomaRt)
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")

gene_ids_version <- c("ENSG00000236246.1",
                      "ENSG00000281088.1",
                      "ENSG00000254526.1",
                      "ENSG00000223575.2",
                      "ENSG00000201444.1",
                      "ENSG00000232573.1")

Now we can query BioMart, specifying that we want to use the versioned Ensembl Gene IDs by using the following:

getBM(attributes = c('ensembl_gene_id_version',
                     'entrezgene'),
      filters = 'ensembl_gene_id_version', 
      values = gene_ids_version,
      mart = mart)

> 
  ensembl_gene_id_version entrezgene
1       ENSG00000201444.1         NA
2       ENSG00000223575.2         NA
3       ENSG00000232573.1         NA
4       ENSG00000254526.1         NA

However, notice that we only get 4 results returned from our 6 IDs. This is because if you query using a version number, but it isn't the most recent version, it doesn't return a result - not really ideal. Better to do as Emily suggests, and strip the version number to use just the Ensembl gene ID. We'll use the stringr package to do that here:

library(stringr)
gene_ids <- str_replace(gene_ids_version,
                        pattern = ".[0-9]+$",
                        replacement = "")

Now rerun the query with the trimmed IDs and you'll get 5 results this time:

getBM(attributes = c('ensembl_gene_id',
                     'entrezgene'),
      filters = 'ensembl_gene_id', 
      values = gene_ids,
      mart = mart)

>
  ensembl_gene_id entrezgene
1 ENSG00000201444         NA
2 ENSG00000223575         NA
3 ENSG00000232573         NA
4 ENSG00000236246         NA
5 ENSG00000254526         NA

The completely missing entry is because that gene, ENSG00000281088, has been retired from Ensembl, so you'll never get a result. The NA values for the rest are because there's no mapping between Ensembl and Entrez for those genes.

Just to check it's really working we'll demonstrate with some IDs that can be mapped.

getBM(attributes = c('ensembl_gene_id',
                     'entrezgene'),
      filters = 'ensembl_gene_id', 
      values = c('ENSG00000001460', 'ENSG00000008517', 'ENSG00000009724'),
      mart = mart)

>
  ensembl_gene_id entrezgene
1 ENSG00000001460      90529
2 ENSG00000008517       9235
3 ENSG00000009724      10747
ADD COMMENTlink
0
Entering edit mode
14 months ago
Limoges, CBRS, France

Something like this ? In R console :

data <- c("ENSG00000236246.1","ENSG00000281088.1","ENSG00000254526.1","ENSG00000223575.2","ENSG00000201444.1","ENSG00000232573.1")
data_modified <- sapply(strsplit(data,"\\."), function(x) x[1])
ADD COMMENTlink
0
Entering edit mode
2.2 years ago
PavolG • 0
Bethesda/NIH

My favorite version to strip the versions. Used dplyr and data.table functions nth() and tstsplit() respectively.

nth(tstrsplit(gene_ids_version, split ="\\."),n=1)
ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1