Question

Question about getSequence function of biomaRt in R

0

Entering edit mode

5.4 years ago

Ruby ▴ 10

Hi everyone,

I am working on a project about copy number variant calling algorithms. One question I am asking is that if GC content correlated with the success rate of CNV calling, and to answer that I would need the actual sequence. However, I don't have the original bam or bed files; I only get the output file from the different CNV calling algorithms, and a corrected consensus file among all of these algorithms. (It would be hard for me to explain how did I calculate consensus, so I would leave that part out for now.)

The information I have now is basically just chromosome and position on the chromosome.

"example: ch1:1453:20918"

I am thinking about use these positions info to line back to hg19 reference data, and get the exact sequence, then calculate the GC content by hand.

I called the getSequence function in biomaRt package, the call returned a dataframe with two columns: sequence and gene symbol. When I choose seqType="cdna", I was expecting the sum of length of each fragment/gene equal to the length of the calling segment(end-start), but it did not.

ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
result = getSequence(chromosome=1, start=28028201, end=32976095,type="entrezgene",seqType="cdna", mart=ensembl)

#count all the nucleotides
count <- 0
for (i in 1: nrow(result)){
length <- length(unlist(strsplit(result[i,1], split = "")))
count <- count+length
}
end-start
[1] 4947894
count
[1] 16861601

Here is the detailed info about categories in seqType: The type of sequence returned can be specified by the seqType argument which takes the following values:

'cdna': for nucleotide sequences

'peptide': for protein sequences

'3utr': for 3' UTR sequences

'5utr': for 5' UTR sequences

'gene_exon': for exon sequences only

'transcript_exon_intron': gives the full unspliced transcript, that is exons + introns

'gene_exon_intron' gives the exons + introns of a gene;'coding' gives the coding sequence only

'coding_transcript_flank': gives the flanking region of the transcript including the UTRs, this must be accompanied with a given value for the upstream or downstream attribute

'coding_gene_flank': gives the flanking region of the gene including the UTRs, this must be accompanied with a given value for the upstream or downstream attribute

'transcript_flank': gives the flanking region of the transcript exculding the UTRs, this must be accompanied with a given value for the upstream or downstream attribute

'gene_flank': gives the flanking region of the gene excluding the UTRs, this must be accompanied with a given value for the upstream or downstream attribute

Am I understanding the concept or output wrong? Or should I change some parameters to make this work?

Appreciate all your help in advance.

R biomaRt • 3.6k views

ADD COMMENT • link updated 5.4 years ago by Emily 23k • written 5.4 years ago by Ruby ▴ 10

1

Entering edit mode

Please use appropriate tags. Your question is about biomaRt. That should have been a tag when you created the question. When you add appropriate tags, users that follow the tag (usually experts interested in helping others in that subject matter) get notified of your question, and this means you stand a better chance at getting a relevant, useful response faster.

You can edit your question and add the tag now.

ADD REPLY • link 5.4 years ago by Ram 43k

score 5 · Accepted Answer · 2018-12-13

BioMart is gene-centric. It gives you the sequences of genes that fall with the region you specify, not the sequence of the genomic region. "cDNA" is the sequence of the spliced RNA in DNA (ie Ts rather than Us) format.

The easiest way to get what you need is using the Ensembl REST API with the sequence/region endpoint. You can code around that in your favourite language to get the data how you want it, and if you want to input multiple regions then there's a POST version of the same endpoint.