Biostar Beta. Not for public use.
How to get a nucleotide at a specified position using a GenBank chromosome accession ID with biomaRt?
0
Entering edit mode
11 months ago
foxdie • 10

What I have: I have analyzed a set of WES data with VarScan 2. I now have a somatic variant call .csv with "chrom", "position", "ref", "var", etc. as columns headers. E.g.:

chrom       position  ref   var   normal_reads1
CM000994.2  3681266     T     G             229
CM000994.2  6558171     A     C              11
.
.
.

What I need: I am trying to (1) get the reference nucleotide at a position as well as the flanking nucleotides (i.e. n–1 and n+1). I then need to concatenate them to make a trinucleotide string for each variant (e.g. "ATG", where ref="T", n–1="A", n+1="G").

My problem: The VarScan 2 output has, as far as I can tell, GenBank accession ID's as values under the "chrom" column (e.g. CM000994.2) I am working with mouse build mm10 and have set that as my biomaRt dataset. I am trying to use biomaRt to get the reference nucleotide, the n–1 nucleotide, and the n+1 nucleotide. However, I can't figure out how to use GenBank accession ID as a query input for biomaRt. Here is an example of what I am getting:

> getSequence(chromosome = CM000994.2, start = 3681267, end = 3681267, upstream = 2, mart = mouse_dataset, seqType = 'cdna', type = chromosome_name)

Error in getBM(c(seqType, type), filters = c("chromosome_name", "start",  : 
  object 'CM000994.2' not found

How do I either 1) access nucleotide positions using a GenBank chromosome accesion ID and a position, or 2) convert GenBank chromosome accession ID to a biomaRt-usable chromosome label? ANY help would be appreciated.

EDIT: Should I just use BIostrings+BSgenome+GenomicRanges for this?

ADD COMMENTlink
1
Entering edit mode
10 months ago
pbpanigrahi • 180

The error says: object CM000994.2 not found. Try giving in double quote. It may solve. Let us know if it don't.

getSequence(chromosome = "CM000994.2", start = 3681267, end = 3681267, upstream = 2, mart = mouse_dataset, seqType = 'cdna', type = chromosome_name)

"CM000994.2"

ADD COMMENTlink
0
Entering edit mode

I think OP has to remove .2 extension. However, OP didn't post what following objects are chromosome_name, mouse_dataset are. Some thing like this: data[,1]=gsub("\\.[0-9]$","",data[,1])

ADD REPLYlink
0
Entering edit mode

@cpad, based on the R error, "object 'CM000994.2' not found", it seems obvious to me that he has missed double or single quote since CM000994.2 is character string. If we try XYZ in R console..we will get error that object 'XYZ' not found.

I agree with you that following objects are chromosome_name, mouse_dataset are missing, based on which we can precisely know the error.

ADD REPLYlink
0
Entering edit mode

@pb: you are correct that OP has to quote chormosome name.

ADD REPLYlink
0
Entering edit mode

@pbpanigrahi, what do you mean:

following objects are chromosome_name, mouse_dataset are missing

? For reference, I have mouse_dataset = useDataset("mmusculus_gene_ensembl", mart = ensembl) if that is what you wanted to know. Let me know if that helps.

ADD REPLYlink
0
Entering edit mode

(1of3) Hi, thanks for the reply.

When I enter:

> getSequence(chromosome = "CM000994.2", start = 3681267, end = 3681267, upstream = 2, mart = mouse_dataset, seqType = "cdna", type = chromosome_name)

I get a similar object not found error again for type = chromosome_name:

Error in attributes %in% listAttributes(mart, what = "name") : 
object 'chromosome_name' not found

Even then, when I put "chromosome_name" in quotes, the query goes through, but it doesn't return any data:

> getSequence(chromosome = "CM000994", start = 3681267, end = 3681267, upstream = 2, mart = mouse_dataset, seqType = "cdna", type = "chromosome_name")
[1] cdna            chromosome_name
<0 rows> (or 0-length row.names)

I'm not sure what I'm doing wrong. It doesn't seem to actually be accessing the genome coordinates. Is it possible that it's simply not recognizing the GenBank ID, or would it give me a definitive error?

From Google/NCBI, I know that GenBank ID CM000994.2 is mouse chromosome 1, so I decided to do some sleuthing and went into the mouse_dataset@filters to see what chromosome labels ("options") are actually accepted for chromosome:

> colnames(mouse_dataset@filters)
[1] "name"            "description"     "options"        
[4] "fullDescription" "filters"         "type"           
[7] "operation"       "filters8"        "filters9" 

> mouse_dataset@filters$name[1:5]
[1] "chromosome_name" "start"           "end"            
[4] "band_start"      "band_end"
ADD REPLYlink
0
Entering edit mode
 > mouse_dataset@filters[1,]
         name              description
1 chromosome_name Chromosome/scaffold name
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        options
1 [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,CHR_CAST_EI_MMCHR11_CTG4,CHR_CAST_EI_MMCHR11_CTG5,CHR_MG51_PATCH,CHR_MG65_PATCH,CHR_MG74_PATCH,CHR_MG89_PATCH,CHR_MG104_PATCH,CHR_MG117_PATCH,CHR_MG132_PATCH,CHR_MG153_PATCH,CHR_MG171_PATCH,CHR_MG184_PATCH,CHR_MG190_MG3751_PATCH,CHR_MG191_PATCH,CHR_MG209_PATCH,CHR_MG3172_PATCH,CHR_MG3231_PATCH,CHR_MG3251_PATCH,CHR_MG3490_PATCH,CHR_MG3496_PATCH,CHR_MG3530_PATCH,CHR_MG3561_PATCH,CHR_MG3562_PATCH,CHR_MG3609_PATCH,CHR_MG3618_PATCH,CHR_MG3627_PATCH,CHR_MG3648_PATCH,CHR_MG3656_PATCH,CHR_MG3683_PATCH,CHR_MG3686_PATCH,CHR_MG3700_PATCH,CHR_MG3712_PATCH,CHR_MG3714_PATCH,CHR_MG3829_PATCH,CHR_MG3833_MG4220_PATCH,CHR_MG3835_PATCH,CHR_MG3836_PATCH,CHR_MG3999_PATCH,CHR_MG4136_PATCH,CHR_MG4138_PATCH,CHR_MG4151_PATCH,CHR_MG4162_PATCH,CHR_MG4180_PATCH,CHR_MG4198_PATCH,CHR_MG4200_PATCH,CHR_MG4209_PATCH,CHR_MG4211_PATCH,CHR_MG4212_PATCH,CHR_MG4213_PATCH,CHR_MG4214_PATCH,CHR_MG4222_MG3908_PATCH,CHR_MG4243_PATCH,CHR_MG4248_PATCH,CHR_MG4249_PATCH,CHR_MG4254_PATCH,CHR_MG4255_PATCH,CHR_MG4259_PATCH,CHR_MG4261_PATCH,CHR_MG4264_PATCH,CHR_MG4265_PATCH,CHR_MG4266_PATCH,CHR_MG4281_PATCH,CHR_MG4288_PATCH,CHR_MG4308_PATCH,CHR_MG4310_MG4311_PATCH,CHR_MMCHR1_CHORI29_IDD5_1,CHR_PWK_PHJ_MMCHR11_CTG1,CHR_PWK_PHJ_MMCHR11_CTG2,CHR_PWK_PHJ_MMCHR11_CTG3,CHR_WSB_EIJ_MMCHR11_CTG1,CHR_WSB_EIJ_MMCHR11_CTG2,CHR_WSB_EIJ_MMCHR11_CTG3,GL456210.1,GL456211.1,GL456212.1,GL456216.1,GL456219.1,GL456221.1,GL456233.1,GL456239.1,GL456350.1,GL456354.1,GL456372.1,GL456381.1,GL456385.1,JH584292.1,JH584293.1,JH584294.1,JH584295.1,JH584296.1,JH584297.1,JH584298.1,JH584299.1,JH584303.1,JH584304.1,MT,X,Y]
  fullDescription filters type operation
1                 filters text         =
                        filters8  filters9
1 mmusculus_gene_ensembl__gene__main name_1059
ADD REPLYlink
0
Entering edit mode

(3of3)So, from this I tried to run the following:

> getSequence(chromosome = "1", start = 3681267, end = 3681267, upstream = 2, mart = mouse_dataset, seqType = 'cdna', type = 'chromosome_name')
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              cdna
1 CCTCCTTCAGTCTTCCCCTCACATGGCCCATCTCCCCACACTTTACTTCACCTCTGAGAAGAGGGAGGCCCCTCACAGGTAGCCAACACACTCTATCACTTCAGGTCATTGCAGGACTAGGTACACTCTCCCTCTGAGGCTAGGCAAAACAGCTCAGTTTCAAGAACAGAATTAAAGTGAGAATTCTTAAATAAAAGACATAAGAATACTTTAAACCCTGCTCCAAACAGACACGTTAAGCAATCATGAACACTATTTTTGTTTTGTTTTGTTCTGTTTTACAATGCTTAGAAAATAAAGGAATGTATATGTTGTAGTTCTTATTGTGGTTGTCTAATTAAGACAACCCAATATCAACTAAAATAAATGAATAAATAAAAGAGCCGTAACACTAATATCCTCAGAGAATCTCCCAAACATGCATAACTCAGTCATTTTCATTGACTGGCTGGTGAACAACACACAAAAAAGGAGAATATCTGAGCCTTGCTCTCTTCCAGAGGCACATAGCCCAAGGAAGAAATACAGACTGACTGGGGTAGAAGTATTAAATATCCACACATGGCCTTACAACTGCTCAGTGCTGACAATATTTTTTTTAATTATTTTAAAAACTAACACAATGTTTCTTTGTTCACTGAATGATTTGTATATATATCCCCTTTATTCCTCTTCTATAGATTTTAGTATCTTCTTTCTTATCTATTTTATTCTCCTAATAACAAACTTCTATCTAGCATCTATCTATATATCTAACTAACTTTCATTTTATTTTGATTCTCAGTATGTTTCTCTTGCCTTCATTGGGGGCTCTGCTCCTTCTGGGTCTATCCCAAGAGAAAGAGTGGGGGAGGTTTCTAGAGTTTAAAACAAAGAGGTAGCACATTTTTTACTTTAAATATTCTAACTTTCTTTTTATTCATTTTCTTCTCTCCTATTTTTCTATCTATCTATTATGTTTGATATGTTTTGGTTATTTCTGCTATGATGAATCCTATTATTTCTATGATGGTTCAAGATGTGATTATTAGCACTTAGTTTTTATCTTATTCTTGTATGTGTTTTACTGCTACTTGTTGCTGATATTTTTATTTCTCCTTAATACATGGAGCTTTAATGGCCATGCCTCCTGTGTAGACGTTCTGCCACTAAGTGAAAGCCTCAGTCCAAAGAATAGGGGAGATAAAAACATCTCTATGTCAATGACTACTTTCCAAATAGATAAAGGAAGAGAATAGGAGTTCTGAAGACCCATTCGTGGACTTTCATGACTACAGCAAGTATAAAAAGAAGCAGCAAAAAGTACAAGGGCACACATTTTTGAGAAACTTCAGACAAAGCCTTTCAAAGCATGCATTACCAATAAAAACAAGAGGAATAATTACAGAGGGTACATCCTACACATTGCTCCATCAACATGATATCTTCCTGACTGTGCCCTTCCCCCAGCTTGTATAGGCTAAACAAACCCTGGTGGTTTTACAGCCTGGCTTCACCACATTTGCATTTTCTCGACCAGAGCTTTGAGAGGCTGACTACCTGCTGAGCTTGCTTTGCAACACAATCCAGCTGAAACAAAGAATAGATCTGTGTTTTTTCCTAATCAGAGCTGAAAATTAAACTTTGATCCTTGATC
chromosome_name
1               1

So, it returned something, but it returned the whole section of cDNA, which is not what I want. I simply want the nucleotide, n–1, and n+1 in the reference genome at the given positions! I'm pretty lost.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1