Biomart Annotation
2
3
Entering edit mode
10.2 years ago
int11ap1 ▴ 470

Good evening,

I have a vector (in R) of probes from an Affymetrix microarray. I would like to find the Ensembl ID, the gene name (hgnc), the gene length and the GC-content using the library BiomaRt in R. In order to do it, I do:

# Finding Ensembl IDs
data <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
ensemblids <- getBM(attributes=c("ensembl_gene_id"), filters=c("affy_hg_u133a"), values=probes, mart=data)
# Finding gene name (hgnc), gene length and GC-content
dframe <- getBM(attributes=c("hgnc_symbol", "percentage_gc_content"), filters=c("ensembl_gene_id"), values=ensemblids, mart=data)

However, as you see, I only obtain the gene name and the GC content because I do not find any attribute related in obtaining the gene length. Do you know how to solve this? Another thing. In my vector I have 22.000 genes, but in ensemblids there are 16.000 Ensembl IDs. Why is it?

Thanks in advance.

biomart annotation • 8.3k views
ADD COMMENT
4
Entering edit mode
10.2 years ago
Emily 23k

Neil is right. There isn't 1:1 mapping between Affy probes and Ensembl IDs. Some probes will map to the same gene, particularly if that gene is quite large. Depending on your chip, they may not map to genes at all. Another source of confusion may be the way that we handle probes in our database. We don't take the databases from Affy stating which probe goes with which gene. Instead we map the sequences of their probes to the genome and see where they map to genes. This may also lead to us reporting different genes to each probe than they do. There's a help page that explains this here.

ADD COMMENT
0
Entering edit mode

Thank you, Emily!

ADD REPLY
2
Entering edit mode
10.2 years ago
Neilfws 49k

1) You can get a data frame containing all attributes like this:

attrs <- listAttributes(data)

Then, grep for attributes named length. CDS Length might be useful?

attrs[grep("length", attrs$name),]
#            name description
# 149  cds_length  CDS Length
# 1764 cds_length  CDS Length

2) The short, unsatisfying answer is that for various reasons, not every HGNC symbol maps directly to an Ensembl Gene ID. I'm sure Emily_Ensembl can tell you more about that.

ADD COMMENT
0
Entering edit mode

Hi Neilfws, using cds_length (actually, it is not what I am looking for: gene length != CDS length) I obtain an error: Error in getBM(attributes = c("hgnc_symbol", "cds_length", "percentage_gc_content"), : Query ERROR: caught BioMart::Exception::Usage: Attributes from multiple attribute pages are not allowed

ADD REPLY
2
Entering edit mode

There are different sections that you can get attributes from. To see how this is structured, have a look at the BioMart browser tool.

We don't actually have gene length as an attribute, but you can get the start and end coordinates, then just do some arithmetic. The start and end are in the same section as the other attributes you need, so you can get everything you need in a single query.

ADD REPLY
0
Entering edit mode

Google that error; it's quite common and means that you're trying to query tables that are not linked. You'll need to do 2 separate queries, then merge the results.

ADD REPLY

Login before adding your answer.

Traffic: 2111 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6