how to extract the genomic positions and chromosome number for a list of genes
2
2
Entering edit mode
5.2 years ago
Star ▴ 60

Hi all,

I have the list of genes having ensembl id's like "ENSG00000272379.1", I want to retrieve the corresponding chromosome number and start and end of the gene location on the chromosome. I have tried using the Biomart Ensembl (http://asia.ensembl.org/biomart/martview/e96b1b88c4e0cbaf1b9d7442ed9f9b68) but it does not process all the genes in the text file. For example, I have 6000 genes and it outputs the results of only 1200 genes. My input file looks like this

ENSG00000272379.1
ENSG00000175600.11
ENSG00000224017.1
ENSG00000112137.12

I don't know where I am doing it wrong. Any advice would be appreciated. Thanks.

genome Ensembl Biomart • 11k views
ADD COMMENT
6
Entering edit mode
5.2 years ago
tiago211287 ★ 1.4k

The numbers after the dot are the gene version. It might be that you have very old gene versions.

You may try to use R to get your information. Removing the Ensembl gene version. Like this:

#Get gene names annotation
source("http://bioconductor.org/biocLite.R")
BiocInstaller::biocLite("biomaRt")
library(biomaRt)
biolist <- as.data.frame(listMarts())
ensembl=useMart("ensembl")
esemblist <- as.data.frame(listDatasets(ensembl))
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
filters = listFilters(ensembl)
attributes = listAttributes(ensembl)

t2g<-getBM(attributes=c('ensembl_gene_id',"ensembl_gene_id_version",'chromosome_name','start_position','end_position'), mart = ensembl)

my_ids <- data.frame(ensembl_gene_id_version=c("ENSG00000272379.1","ENSG00000175600.11","ENSG00000224017.1", "ENSG00000112137.12"))
my_ids$ensembl_gene_id <- gsub("\\..*","", my_ids$ensembl_gene_id_version)

my_ids.version <- merge(my_ids, t2g, by= 'ensembl_gene_id')

_

my_ids.version
  ensembl_gene_id ensembl_gene_id_version.x ensembl_gene_id_version.y chromosome_name start_position end_position
1 ENSG00000112137        ENSG00000112137.12        ENSG00000112137.17               6       12716805     13290484
2 ENSG00000175600        ENSG00000175600.11        ENSG00000175600.15               7       40134977     40860763
3 ENSG00000224017         ENSG00000224017.1         ENSG00000224017.1               7       41101604     41133507
4 ENSG00000272379         ENSG00000272379.1         ENSG00000272379.1               6       13290018     13290490
ADD COMMENT
0
Entering edit mode

Thank you for the code Tiago211287. But at the end my file (my_ids.version) is returned empty. I had replaced the line

my_ids <- data.frame(ensembl_gene_id_version=c("ENSG00000272379.1","ENSG00000175600.11","ENSG00000224017.1","ENSG00000112137.12"))

with

test <- read.table("MetaXcanOutput-BiomartInput.txt")
my_ids <- data.frame(ensembl_gene_id_version=c(test$v1))

where MetaXcanOutput-BiomartInput.txt is the file containing almost 6000 gene ids along with the version number. Moreover, the data frame object "my_ids" contains two identical column as follows.

ensembl_gene_id_version      ensembl_gene_id
          6544                6544
          4060                4060
          5340                5340

I am pretty new to data science and R. But according to my understanding, in the last code my_ids.version <- merge(my_ids, t2g, by= 'ensembl_gene_id') it cannot find the proper ensembl_gene_id due to which my_ids.version is empty. Am I right? Can you further suggest?

ADD REPLY
0
Entering edit mode

You need a column with your regular gene_id_versions called ensembl_gene_id_version. And Another column with the edited ensembl_gene_id without versions. created with gsub.

my_ids$ensembl_gene_id <- gsub("\\..*","", my_ids$ensembl_gene_id_version)

Only then you can merge, using:

my_ids.version <- merge(my_ids, t2g, by= 'ensembl_gene_id')
ADD REPLY
0
Entering edit mode

Its done!!! Thank you so much. But there is some problem. In the input file I have 6605 gene ids, however I get the results for the 6414 genes (I tried the same with the Biomart online tool as well). 191 genes are missing. Which means that those genes are not present in the database. What could be possible solution for it? How can I get the information about the remaining (all) genes?

ADD REPLY
0
Entering edit mode

Can you please share a sample of the remaining id's?

ADD REPLY
0
Entering edit mode

Hi, Few are some of the genes missing from the data are listed below. However when I searched these genes in GRCh37 build, it mapped to the respective genes but no results were available for GRCh38.p12.

ENSG00000264469.1
ENSG00000179837.6
ENSG00000272216.1
ENSG00000225490.1
ENSG00000186275.7
ADD REPLY
0
Entering edit mode

Is there a way to extract GRCh37 using biomart in R?

ADD REPLY
0
Entering edit mode

You can try to convert your retired IDs using the ID Conversion tool for GRCh37. Or maybe this Biostar link can help you to access the GRCh37 using biomaRt.

This Ensembl page contains information about converting between both assemblies.

ADD REPLY
0
Entering edit mode

Hello aammarah.632

All of the above suggestions are great - BioMart will work with versioned IDs but you need to select the correct format, and if they are not existing in the current database (regardless of the version) then no results will be found.

I took a look at a couple of your IDs, I can see that they are not in the dedicated GRCh37 site which is the database we continue to update with new data. However, I could find them in the archive site for GRCh37 from 2014, which is not updated so remains a snap shot of the data from 2014. This suggests to me that these genes are no longer in the current database, probably because the annotation has been reviewed and they were found to no longer be correct as new data (e.g. cDNA, protein, or EST) has become available.

You could pass your list of lost IDs through the archive's BioMart either on the website or through the R package - you can see how to do the latter here - you need to specify release 75: How To Use Archived Version Of Ensembl In Biomart. If you want to you can extract the coordinates and map them to GRCh38 using our Assembly converter.

ADD REPLY
0
Entering edit mode

hmm right. Thankyou. I have extracted the positions of all the genes from GRCh37. Thanks for the help tiago211287 and Erin_Ensembl.

ADD REPLY
1
Entering edit mode
5.2 years ago

Hello,

you define IDs with version numbers. Some of the versions are not available anymore in the current release. What you can do is:

  1. Filter for Gene Stable ID(s) without version number(cut -d"." -f1 ensg.txt > ensg_noversion.txt can be used to create a file without version numbers)
  2. Goto grch37.ensembl.org and try and there. But be aware that the position you will receive are based on GRCh37/hg19 and not GRCh38/hg38.

fin swimmer

ADD COMMENT
0
Entering edit mode

Thankyou for the reply!!! I did it using R as well as Biomart interaface. I had a total of 6605 genes. However I get the required information for only 6414 genes. The data for 191 genes are missing (using GRCh38.p12). Is there any way to get the required information for all the genes?

ADD REPLY
0
Entering edit mode

Could you please provide some of the ids that are missing?

ADD REPLY
0
Entering edit mode

Hi, Few are some of the genes missing from the data are listed below. However when I searched these genes in GRCh37 build, it mapped to the respective genes but no results were available for GRCh38.p12.

ENSG00000264469.1
ENSG00000179837.6
ENSG00000272216.1
ENSG00000225490.1
ENSG00000186275.7
ADD REPLY

Login before adding your answer.

Traffic: 2909 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6