How to query sequences from reference genome with R?
1
0
Entering edit mode
8.5 years ago
amir.jebelli ▴ 50

I need to query hg19 reference genome and retrieve several 50 nucleotides bases from various positions (exonic, intronic, etc). I found biomaRt in bioconductor package and for testing to check it works properly I use the following command but it returns empty result :

ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
result = getSequence(chromosome=2, start=86374868, end=86374877,type="entrezgene",seqType="gene_exon_intron", mart=ensembl)

I expect to see "acatctcgca" as a result. I think the problem is come from miss configuration in function parameters. Am I using the right tool for this problem ?

I will appreciate if you introduce other tools for querying reference genome through web service.

R biomaRt • 4.9k views
ADD COMMENT
3
Entering edit mode
8.5 years ago

That's probably because biomaRt defaults to the latest assembly, which is GRCh38 in respect to Human. Try the following when creating your biomaRt object.

ensembl    <- useMart("ensembl",
                      dataset = "hsapiens_gene_ensembl",
                      host    = "grch37.ensembl.org",
                      path    = "/biomart/martservice")​
ADD COMMENT
0
Entering edit mode

Hi, Thanks for your response, but I get the error "Incorrect biomart name"

I thinks the problem is not related to version of reference genome, In that case I should get a sequence but always I receive empty result.

> ensembl    <- useMart("ensembl",
+                       dataset = "hsapiens_gene_ensembl",
+                       host    = "grch37.ensembl.org",
+                       path    = "/biomart/martservice")
Error in useMart("ensembl", dataset = "hsapiens_gene_ensembl", host = "grch37.ensembl.org",  :
  Incorrect BioMart name, use the listMarts function to see which BioMart databases are available
ADD REPLY
1
Entering edit mode
ensembl = useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org",dataset="hsapiens_gene_ensembl")
ADD REPLY
0
Entering edit mode

Thanks Emily, That resolved the BioMart name problem but when I use the following command to fetch 10 bases it fetch very long sequence, Do you have any idea?

getSequence(chromosome=2, start=86374868, end=86374877,type="entrezgene",seqType="gene_exon_intron", mart=ensembl)
ADD REPLY
1
Entering edit mode

That's not getting you ten bases, that's getting you the gene sequences of every transcript that overlaps that ten bases. BioMart is a gene-centric tool. It cannot be used to get genome sequences.

The REST API is a quick easy way to get sequences of regions.

ADD REPLY
0
Entering edit mode

The REST API is exactly what I was looking for,

Thanks a lot for helping Emily.

ADD REPLY

Login before adding your answer.

Traffic: 3059 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6