How to calculate dinucleotide A/T patterns in a given human reference genome region?
0
1
Entering edit mode
5.3 years ago
Ashley ▴ 90
library("Biostrings")

I want to calculate the probability of di-nucleotide AA, TT, AT, and TA in each 2 location.

My DNA sequence is as follows:

DNA.set
  A DNAStringSet instance of length 5
    width seq                                                              names               
[1]    20 TCCGTATTGGAAAGCTCGTC                                             SEQ-1
[2]    20 TTAGACCACTCCGCATGTAG                                             SEQ-2
[3]    20 CTGTGGTACGGCTCAAACGG                                             SEQ-3
[4]    20 CTCCCGCCTATCTCCCTTCT                                             SEQ-4
[5]    20 TCGCCTAGAAAAAGTTTCCT                                             SEQ-5

I want to obtain the result as follows:

AA=0,0,0,0,1/5,2/5,0,1/5,0,0
TT=1/5,0,0,0,0,0,0,1/5,1/5,0
AT=0,0,0,0,0,0,0,0,1/5,0,0
TA=0,0,1/5,1/5,1/5,0,0,0,0,0

Any help would be great appreciate.

R sequence • 1.4k views
ADD COMMENT
0
Entering edit mode

dinucleotideFrequency function of biostrings could give those 2mers. than you can take the subset of your desired ones.

ADD REPLY
0
Entering edit mode
> dinucleotideFrequency(DNA.set)
     AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
[1,]  2  0  1  1  0  1  2  1  1  1  1  2  1  3  1  1
[2,]  0  2  2  1  2  2  1  1  1  1  0  1  2  1  1  1
[3,]  2  2  0  0  1  0  2  2  0  1  3  2  1  1  2  0
[4,]  0  0  0  1  0  5  1  5  0  1  0  0  1  4  0  1
[5,]  4  0  2  0  0  2  1  2  1  1  0  1  1  2  0  2

Thanks for your reply. But I want to know the frequency or probability of A/T in each position. Not total number. So dinucleotideFrequency maybe isn't suitable for me.

ADD REPLY
0
Entering edit mode

consensusMatrix(dinucleotideFrequency(DNA.set)) ? maybe ?

ADD REPLY
0
Entering edit mode

For our example,

> consensusMatrix(dinucleotideFrequency(DNA.set))
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
0    2    3    2    2    3    1    0    0    2     0     3     1     0     0     2     1
1    0    0    1    3    1    1    3    2    3     5     1     2     4     2     2     3
2    2    2    2    0    1    2    2    2    0     0     0     2     1     1     1     1
3    0    0    0    0    0    0    0    0    0     0     1     0     0     1     0     0
4    1    0    0    0    0    0    0    0    0     0     0     0     0     1     0     0
5    0    0    0    0    0    1    0    1    0     0     0     0     0     0     0     0

Thanks for your kind help. But I think the column of result should be the length(DNA.seq)/2=10, however, the column is 16. And it didn't show which column represents for AA, AT, TA and TT. I am the newcomer of bioinformatics, could you help me figure it out? Thank you so much. With my best wishes.

I think the number of column is always the 16. For another example,

> dna2
  A DNAStringSet instance of length 3
    width seq                                                              names               
[1]     4 ACGT                                                             seq1
[2]     4 GTCA                                                             seq2
[3]     4 GCTA                                                             seq3
> consensusMatrix(dna2, as.prob = TRUE)
       [,1]      [,2]      [,3]      [,4]
A 0.3333333 0.0000000 0.0000000 0.6666667
C 0.0000000 0.6666667 0.3333333 0.0000000
G 0.6666667 0.0000000 0.3333333 0.0000000
T 0.0000000 0.3333333 0.3333333 0.3333333
M 0.0000000 0.0000000 0.0000000 0.0000000
R 0.0000000 0.0000000 0.0000000 0.0000000
W 0.0000000 0.0000000 0.0000000 0.0000000
S 0.0000000 0.0000000 0.0000000 0.0000000
Y 0.0000000 0.0000000 0.0000000 0.0000000
K 0.0000000 0.0000000 0.0000000 0.0000000
V 0.0000000 0.0000000 0.0000000 0.0000000
H 0.0000000 0.0000000 0.0000000 0.0000000
D 0.0000000 0.0000000 0.0000000 0.0000000
B 0.0000000 0.0000000 0.0000000 0.0000000
N 0.0000000 0.0000000 0.0000000 0.0000000
- 0.0000000 0.0000000 0.0000000 0.0000000
+ 0.0000000 0.0000000 0.0000000 0.0000000
. 0.0000000 0.0000000 0.0000000 0.0000000
> dim(consensusMatrix(dna2, as.prob = TRUE))
[1] 18  4
> consensusMatrix(dinucleotideFrequency(dna2))
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
0    3    2    3    3    2    3    2    2    3     2     3     1     2     2     3     3
1    0    1    0    0    1    0    1    1    0     1     0     2     1     1     0     0
ADD REPLY
0
Entering edit mode
ADD REPLY

Login before adding your answer.

Traffic: 2694 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6