removing duplicate SNPs (same position) with lowest call rate
2
0
Entering edit mode
8.2 years ago

I am trying to solve a problem with my genotyped array data set. For reason or another, the data set has duplicate or with three different names pointing to the same position. For example:

index  SNP    pos        A1  A2  F_MISS
2046   snp_1  113890304  C   T   0
2047   snp_2  113890304  C   T   0.000422
2048   snp_3  113890304  C   T   0

I want to build a list for SNP names to be removed (so I can exclude them in PLINK).

So from the SNPs above, snp_1 or snp_3 and snp_2 should be in removal list.

How would I achieve this?

snp genome • 2.5k views
ADD COMMENT
0
Entering edit mode
8.2 years ago
TriS ★ 4.7k

If you just want to remove duplicates in R (not tested):

name_position <- apply(mySNPmatrix,1,function(x) paste(x[2],x[3],sep="_"))
mySNPmatrix <- mySNPmatrix[-which(duplicated(name_position)),]

However, it seems that the F_MISS col is not duplicated, so pay attention to that when removing rows

ADD COMMENT
0
Entering edit mode
8.2 years ago

Try this bash one-liner (not tested). You may need to lose the header line though.

sort -k 3 -k 6 input.txt | awk '!seen[$3]++' | awk '{print $2}' > output.txt
ADD COMMENT

Login before adding your answer.

Traffic: 1858 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6