Hello
I want to find the SNPs that could be responsible for the phenotype differences observed between three populations. For that I computed Fst (weir and cockerham) using vcftools
.
One population reflects the founder population (line0) from which the two populations were selected (line1 and line2), each one for a different trait. The phenotypes for each line are highly divergent.
Computing per-SNP Fst produces the following representative .
Computing windowed (window = 500kb; slide = 250kb; min #SNPs=20) Fst produces the following representative .
First, line1 vs line2 yields a different Fst distribution compared to (line1 | line2) vs line0.
Second, window Fst calculation (mean) yields smoother distributions.
I would like to seek advise on the following:
(1) how to define outliers considering the two types of observed Fst distributions?
(2) Is windowed Fst more suitable to identify outliers?
(3) How to define the size and step of a sliding window? (what I choose for this example is based on a similar study, but I guess it might require optimization)
(4) Do I need to do some type of SNP pruning (these SNPs are derived from WGS variant discovery analysis following GATK best practices)?