I have VCF SNP dataset that contains 55K markers and 623 samples. We included in our analysis replicates of some samples that are coded with the exact same name. We also know there are replicates with different names as they came from separate sources.
Is it possible to fill in all the "missing" data in one replicate of the sample using the other replicate as the reference. I would want to only fill in missing data and not change anything else.
I have been able to write some R code that can do this to files in STRUCTURE format as that format is amenable to import into R.
I am wondering if it is possible to do something similar directly to the VCF file or to all the .bed,.fam,.bim files generated from PLINK from a VCF.
I have looked into VCFtools and bcfTools but the merge funtions and collate functions of those programs are for multiple file sets. That is not what I am trying to do. I am trying to alter data within a single file.
Here is my R code if this makes it any clearer what exatcly I am trying to do.
My DF is R_Test_DFSg60
##Replacing 0s in row 371 with the values from row 573 if not = 0, In STRUCTURE format
for (col in 2:ncol(R_Test_DFSg60)) {
#Check if the value in row 371 of the current column is "0"
if (R_Test_DFSg60[371, col] == 0) {
# Check if the value in row 573 of the current column is not "0"
if (R_Test_DFSg60[573, col] != 0) {
# Check if the values in row 371 and row 573 of the current column are different
if (R_Test_DFSg60[371, col] != R_Test_DFSg60[573, col]) {
# Replace the value in row 371 of the current column with the value from row 573
R_Test_DFSg60[371, col] <- R_Test_DFSg60[573, col]
}
}
}
}
#Then delete row 573, removing it from the dataset
cross posted: https://stackoverflow.com/questions/78462732/
Thanks for the crosspost! Just trying to get things figured out.