Question

Entrez IDs disappear when using Biomart with GRCh37 Genome version

0

Entering edit mode

5.0 years ago

luisa ▴ 10

Hi! I'm trying to get the location (chromosome and band) of a list of Entrez Gene IDs I got using the Homo.sapiens Bioconductor package:

indx <- findOverlaps(genes(TxDb.Hsapiens.UCSC.hg19.knownGene), mycoords.gr)

Since my original data mycoords.gr) are mapped to the GRCh37/hg19 genome version, I tried using Biomart to get the locations using that version of the genome:

ensembl <-useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org", path="/biomart/martservice", dataset="hsapiens_gene_ensembl")

my.symbols <- indx$gene_id

my.regions <- getBM(c("entrezgene","hgnc_symbol", "chromosome_name", "band"),
                    filters = "entrezgene",
                    values = my.symbols,
                    mart = ensembl)

I noticed, however, that some of the Entrez IDs that were on my list were not on "my.regions". When I tried using the current version of the genome, those IDs were present but others were missing...

Is there a difference in Entrez IDs between assemblies? I also tried retrieving all of the Entrez IDs in ensembl and some of them were also missing...

mapping <- getBM(attributes = c("entrezgene", "hgnc_symbol"), mart = ensembl)

I don't understand this... Is there an alternative to this method?

Thanks in advance!

genome assembly Biomart • 2.6k views

ADD COMMENT • link updated 5.0 years ago by Emily 23k • written 5.0 years ago by luisa ▴ 10

0

Entering edit mode

Can you give some examples of IDs that were in the wrong locations or missing, please?

ADD REPLY • link 5.0 years ago by Emily 23k

0

Entering edit mode

Yes, some of the missing ones were 100033416 (hg18) and 10002(hg19) and 100033416 in both

ADD REPLY • link 5.0 years ago by luisa ▴ 10

0

Entering edit mode

I am able to detect the one that you have tagged as hg19:

ensembl <-useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org", path="/biomart/martservice", dataset="hsapiens_gene_ensembl")
getBM(mart=mart, attributes=c("hgnc_symbol", "entrezgene"), filter="entrezgene", values=c("100033416","10002"), uniqueRows=TRUE)
hgnc_symbol entrezgene
    1       NR2E3      10002

ADD REPLY • link 5.0 years ago by Kevin Blighe 87k

0

Entering edit mode

hgnc_symbol entrezgene

1 SNHG14 100033416

This is what I get when I run the same code you wrote ...

ADD REPLY • link 5.0 years ago by luisa ▴ 10

0

Entering edit mode

We have a problem...

ADD REPLY • link 5.0 years ago by Kevin Blighe 87k

0

Entering edit mode

Did you mean hg18/NCBI36 or did you mean GRCh38?

ADD REPLY • link 5.0 years ago by Emily 23k

0

Entering edit mode

I think it is GRCh38

ADD REPLY • link 5.0 years ago by luisa ▴ 10

0

Entering edit mode

Someone will be along with a BioMart answer but if you can post a few entrez ID's we can see if an entrezdirect solution is feasible.

ADD REPLY • link 5.0 years ago by GenoMax 141k

score 2 · Answer 1 · 2019-04-05

BioMart provides mappings from Ensembl genes to external references, it does not provide direct mappings between non-Ensembl identifiers. This means that when you look up NCBI -> HGNC mappings, you're actually looking up NCBI -> Ensembl -> HGNC mappings.

NCBI 10002 does not map to the Ensembl gene ENSG00000031544 in GRCh37 because they have different biotypes, in Ensembl the gene is non-coding. It's non-coding in Ensembl on GRCh37 because Ensembl annotation is based on the genome, the gene sequences have to match the reference genome. NCBI do not have this constraint in their annotation. Because the GRCh37 assembly is flawed, there is no ORF and the gene could only be annotated in Ensembl as non-coding. If you compare the genomic region in GRCh37 to that in GRCh38 you'll see that a number of small contigs (the track with alternating shades of blue) have been introduced in GRCh38, which have fixed the underlying genome, and now the gene is listed as coding and NCBI 10002 is listed as an external reference. This is why we recommend always using the most up-to-date genome assembly.

100033416 does appear in GRCh37 but not in GRCh38. It seems to be one of a load of snoRNAs mapped to ENSG00000224078, which all seem to be small RNAs overlapping a much larger one. None of these are present in GRCh38, which an improvement, I think. Again, this is why we recommend using the most up-to-date data. It looks like the correct match should be to ENSG00000275529, so I'll feed that back to our developers.

Mapping between databases, especially for short sequences in repetitive regions, is quite a difficult problem. We are working with NCBI at the moment to improve our mapping with them, and hopefully this will improve in future.