Extract surrounding genome sequences given certain locations using R
2
1
Entering edit mode
7.3 years ago
DVA ▴ 630

Hello,

I wonder what is the best way/package to extract ~3k bps regions surrounding a certain given location in R. E.g. the location can be Human Chromosome 4 Position 50000. Then I'd like to have the sequence from chr4:47000-53000 of hg19/hg38.

Thank you.

R package genome sequence • 7.5k views
ADD COMMENT
0
Entering edit mode

Do you have to do this with R specifically?

ADD REPLY
0
Entering edit mode

Yes (unfortunately).

ADD REPLY
3
Entering edit mode
7.3 years ago
Chun-Jie Liu ▴ 280

You can use biomaRt

GENES = useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl", host = "useast.ensembl.org")

YouRange<- getBM(attributes = c("chromosome_name", "start_position", "end_position"), filters = c("chromosome_name","start","end"), values = list(chromosome_name = "4", start = 50000-3000, end = 50000 + 3000), mart = GENES)

You can use getSequence get sequences.

getSequence(chromosome = 4, start = 50000-3000, end = 50000+3000,type='hgnc_symbol',seqType = 'gene_exon_intron', mart = GENES)

getBM helps you get the annotation information from Biomart, getSequence helps you get the sequence you need.

This is a simple example for example question, you can check details of biomaRt to solve your specific question.

ADD COMMENT
0
Entering edit mode

Thank you! I'm going to take a look right now:)

ADD REPLY
0
Entering edit mode

Your code gives me an empty data.frame. I had to specify chromosome as "4" instead of "chr4".

ADD REPLY
0
Entering edit mode

Thanks for your correctness. This is a simple example for using biomaRt. The previous version of Ensembl support 'chr' character. You can choose the Ensembl release or genome build by useMart or useEnsembl.

The Ensembl give the tutorial.

ADD REPLY
0
Entering edit mode

No problem. I never know what type of nomenclature I have to use myself, since sometimes it seems to be "chr4" and sometimes "4". I though Ensembl was using "chr4" whereas UCSF style was "4" but it seems I was mistaken...

ADD REPLY
6
Entering edit mode
7.3 years ago
ddiez ★ 2.0k

An option is to use the BSgenome packages and the method getSeq.

library(BSgenome.Hsapiens.NCBI.GRCh38)
# this gives you the object Hsapiens, which you can inspect (output not included).
Hsapiens

# get the 3000 bases upstream of position chr4:47000.
getSeq(Hsapiens, "4", start = 47000 - 3000, end = 47000 - 1)
3000-letter "DNAString" instance
seq: TTTTCCTTTAAAATTGGGGTAATAGCCCATATCCTCCACATCTCAGAAGGTTCATTTTTATTGGTCCAAAGGAAG...CTGAGTTCTGTGAGTATTTTAATCAAATTCTCAAACTTGAGAAAGGGGTTATGGGAGTCCTCACTTCTTAGTAGC
ADD COMMENT
0
Entering edit mode

Thank you very much.

ADD REPLY
0
Entering edit mode

I wrote a small script to extract sequences of multiple SNPs. getSeq function doesn't seem to support it. Any alternate solutions please?

ADD REPLY
0
Entering edit mode
suppressMessages(library(GenomicFeatures))
suppressMessages(library(BSgenome.Hsapiens.UCSC.hg19))

gr=GRanges(seqnames = paste0('chr',seq(1,3)),IRanges(start=c(1,100000,100000),end=c(2,100001,100001)))
tmp=Views(Hsapiens,gr)
sequence=DNAStringSet(as.character(tmp))
ADD REPLY

Login before adding your answer.

Traffic: 2629 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6