Biostar Beta. Not for public use.
How to produce a hom/het snp frequency table
0
Entering edit mode
15 months ago
rob234king • 580
UK/Harpenden/Rothamsted Research

I have snps across different isolates that are either heterozygous or homozygous. What I want to do is convert the data to a hom frequency table that (total hom/no. het + hom) say every 10 million bases and produce a heatmap with each isolate a band on the y axis and on the X axis homozygous snp frequency. Any pointers how could do this in R or even python. Seems like would have been done before but my R skills are low. I can work on a tutorial on heatmap2 in R but not sure how to produce the frequency table?

example data:

1 = yes, 0 = no.

lib het hom snp_position

LIB10000 1 0 917206
LIB10000 1 0 917912
LIB10000 1 0 2703436
LIB10000 1 0 2736063
LIB10000 0 1 3843431
LIB10000 1 0 5195338
LIB10000 1 0 8054844
LIB10000 1 0 8108156
LIB10000 0 1 8685923
LIB10000 0 1 8983713
LIB10000 0 1 8984241
LIB10000 1 0 9391014
LIB10000 1 0 9658660
LIB10000 1 0 12116052
LIB10000 1 0 15798269
LIB10000 0 1 19809883
LIB10000 0 1 25505855
LIB10000 0 1 25541608
LIB10000 1 0 26855440
LIB10000 1 0 27136672
LIB10000 0 1 28417750
LIB10000 1 0 29906291
LIB10000 0 1 41573928
LIB10000 0 1 55496549
LIB10000 1 0 55651887
LIB10000 1 0 59141554

output example:

lib bp hom (%)

LIB10000 10000000 28.5714285714286
LIB10000 20000000 33.3333333333333
LIB10000 30000000 50.0

LIB10000 40000000 0.0

SNP • 1.6k views
0
Entering edit mode

what does the bp mean in your output example? is it the snp_position? Can you give example input and example output based on that input?

0
Entering edit mode

The example output is what would want. The bp is 10,000,000 bp along the chromosome so how many snps are within first 10,000,000 bp = 13, and 4 are hom so 4/13*100 = 30.7 so the first one is a little off byt the other two are correct 33% and 50%. The snp position says where it is on the chromosome so can count up how many in the window. hope this helps.

1
Entering edit mode
16 months ago
Ying W ♦ 3.9k
South San Francisco, CA

Theres two ways I can think of to do this in R. The simple way is to write a function in R that will loop through have a counter that keeps track of where the current 10Mb chunk ends and also a running total of het and hom. When pos in row is greater than that current 10Mb chunk calculate an average based on running total of het and hom.

The more advanced way is to bin the positions using something like round_any(pos_vector, 10000000, f = = ceiling) followed by using tapply(hom_vector, pos_vector, mean)

Example of 2nd way to do it:

example = matrix(c(27136672, 28417750, 29906291, 41573928), nrow=4, ncol=1)
example = cbind(example, c(0,1,0,1))
colnames(example) = c("pos", "hom")
example
pos hom
[1,] 27136672 0
[2,] 28417750 1
[3,] 29906291 0
[4,] 41573928 1
example[,1] = round_any(example[,1], 10000000, f = ceiling) #bin by 10M
tapply(example[,2], example[,1], mean)
3e+07 5e+07
0.3333333 1.0000000

Keep in mind this second method will skip bins where there is no data (such as 40M in example above)

0
Entering edit mode

thanks very much, it's something to work with. I have a number of isolates with few snps so may have to remove those below a total snp threshold or add another dimension to account for snp density. Any ideas how to plot this table giving thresholds of percentage hom different colours please let me know. I think this package may do it "ChromHeatMap" but being an R novice, have to see how easy it is to work with using the tutorial.