Hello, I'm analyzing output of depth of coverage between five breeds. I want to understand where the depth of coverage of genome has changed between five breeds. I have 5 files containing positions in the rows and samples of each breeds in the columns. then I joint them . Now I want to perform an ANOVA in each row to identify positions with different depth across all my samples. How can I do an ANOVA test in R? How should I introduce groups to R?
example of my data:
positon A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 C1 C2 C3 C4 C5
500000 11.6 30.3 28.5 11.3 28.6 26.6 27.5 23.9 19.8 8.4 27.6 19.4 20.4 30.2 31.3
501000 10.8 4.7 5.5 6.9 6.3 7.0 4.2 7.5 5.0 5.3 4.5 6.8 7.4 5.3 5.8
502000 5.3 2.5 4.1 3.1 2.4 5.2 3.5 4.2 5.1 6.1 4.1 4.3 6.2 2.8 3.9
503000 5.0 5.2 3.6 5.1 3.3 3.9 4.3 5.1 4.4 4.0 4.9 2.8 3.3 5.1 4.9
504000 13.0 9.4 10.5 19.2 10.2 9.0 11.1 9.0 8.3 20.2 9.2 18.5 10.9 7.3 9.2
505000 6.5 9.8 10.4 46.9 15.0 6.7 13.6 13.3 9.8 43.6 12.3 43.9 11.7 10.6 14.5
506000 9.4 12.5 14.3 13.4 14.0 11.1 14.1 15.3 11.5 14.5 14.5 15.4 15.1 15.3 15.1
508100 1.2 0.0 0.4 0.0 0.1 4.1 0.8 2.1 1.3 1.4 4.5 2.2 4.7 2.6 5.3
Thanks in advance
Since ANOVA is run in each row for each position of genome between breeds, and the number of positions are around 900,000, I can't plot them, so I want to know is it necessary checking the normality? How can I do that? If Shapiro Test is over-sensitive what procedure is recommended? Thank you for your time
I moved your message to a comment (you had posted it as an answer).
Ah - it is in these situations with large variable numbers whereby the Shapiro test will virtually always say 'not normally distributed', but don't quote me on this. I am not a Professor of Statistics. There is a good discussion here: https://stats.stackexchange.com/questions/12053/what-should-i-check-for-normality-raw-data-or-residuals
Also, given your large number of variables, you will want to 'parallelise' my code (below). You can do this by replacing
%do%
with%dopar%
in theforeach
loop. You will also require doParallel package. Please see here for information on how to choose number of threads / CPU cores (system dependent): R functions for parallel processing