Hello,
I am trying to filter a VCF file based on the R2 value that each SNP has. I would like to only keep the entries which have a R2 >= 0.3.
Are these the right commands/procedure?
plink --vcf dataset.dose.vcf ----recode --out dataset.dose #(vcf to map/ped)
plink --file dataset.dose --indep-pairwise 1345835 5 0.3 --out dataset.dose #(to get a list of SNPs R2 >= 0.3)
plink --file dataset.dose --extract plink.prune.in --make-bed --out pruneddata #(to perform the extraction of the SNPs and convert map/ped files to bed/bim/fam)
Diego
Kevin Blighe any suggestions for running --indep-pairwise with unmapped SNPs? These are genotype by sequencing data and there is not a reference genome available for this species. Wondering what the best approach would be in this scenario in terms of parameter selection.
Can't really do that because, by its very nature, it needs positional information - it is assumed that variants are inherited throughout generations through 'haploblocks' [chunks of DNA], that give rise to the phenomena of linkage.
In which format is your input data?
That makes sense. The input data are in VCF format. They are RADseq data that have already been thinned to a single SNP from each marker. The concern is that some of these RADseq loci may be co-located on haploblocks but there is no real way to know other than maybe aligning to a closely related genome.
Sounds like it could become a bit of a 'rabbit hole' analysis. Not sure what data you have, exactly, but perhaps a synteny analysis may help if you just have sequences of your species of interest and no positional information. By synteny, you could infer positional information of your sequences in other defined species:
For example, take a look: https://rest.ensembl.org/documentation/info/genomic_alignment_region [main Ensembl REST page: https://rest.ensembl.org/]
I did this previously for mapping between Gallus gallus and Homo sapiens, with moderate success.
Should you not do genome assembly and then go from there to obtain positional info? - again, not sure which data it is that you have.
There is a closely related genome to our taxa of interest so I can give that a go.
Basically, we have around 10k SNPs derived from RADseq data (thinned to a single SNP per marker) and we want to LD thin these loci before running in STRUCTURE, where correlated markers can bias results. However it seems that a lot of population genetic studies using RADseq don't take this step (some of my previous work included). I was hoping there was a simple r^2 test to thin these unmapped SNPs but it seems like this is an inappropriate approach to the solution.
Thanks for the insight Kevin Blighe , much appreciated.
I see, molecular ecology? Questions on this topic are not common here, as you can infer. The synteny approach may be novel in your field, which does equally make that somewhat interesting. Other than that, there should not, technically, be anything stopping you from generating r^2 for each pairwise combination of markers from the VCF. Quite a few programs calculate r^2 for LD pruning, as you probably know, and I imagine that you'll find one that doesn't technically require positional info, or at least one with which you can instruct it to just calculate LD across all of your SNPs by setting a large window size. Im not sure how CHROM and BP are encoded in the VCF from RADseq.