Biostar Beta. Not for public use.
Question: Keep one SNP for duplicate SNPs
0
Entering edit mode

I have converted vcf file to bed files, and there are some duplicate SNPs. I would like to remove the duplicate SNPs, but keep one. For example, if rs1234 appears 5 times, I want to keep only one record (maybe the first one).

Right now I used --write-snplist to get the snplist of the bed file, and use R to check the frequency of each snp, and use R to generate a duplicate snplist. With the duplicate snplist, I used --extract to get the duplicate snp bed file, and --exclude to get the bed file without any duplicate snp.

But how could I keep one snp for each duplicate snp? And also, is there a way to do the above steps in plink, without switching to R to generate the duplicate snp list?

ADD COMMENTlink 2.5 years ago janhuang.cn • 150 • updated 2.1 years ago Biostar 20
0
Entering edit mode

What do you mean by duplicate snps? Have they been reported multiple times as in same chromosome/contig with same ref and alt coordinates as well?

This post may be helpful..

How to filter out duplicate records in a vcf with bcftools?

ADD COMMENTlink 2.5 years ago prasundutta87 • 330
Entering edit mode
0

Thank you.

I meant the same SNP was reported in a vcf file (1000G) for multiple times, in the same chromosome.

One example is chr22:18496882 rs35404796 was reported three times, the REF allele is always G, but the ALT are different ("GAC", "GACACAC", "GACACACAC")

Another case is rs7410429 was reported twice, but the chr:pos are different, one is chr22:18003597, another is chr22:18004254, and the REF and ALT are the same.

ADD REPLYlink 2.5 years ago
janhuang.cn
• 150
Entering edit mode
0

Your first example is not of SNPs; they are insertions, and they are different.

I'm not sure why you would want to do what you want to do, but I would write a program to iterate through the VCF file line by line, maintain a hashset of RSIDs, and only retain lines whose RSID has not been seen previously.

ADD REPLYlink 2.5 years ago
Brian Bushnell
16k
Entering edit mode
0

I was calculating the ld using --r2, but it returns Error: Duplicate ID 'rs10656307'. It seems that this one is also insertions, the two rs10656307 records have same chr:pos (chr22:28698027), same REF (A), but different ALT (AAAT and AAATAAT). Therefore I want to exclude duplicate records.

ADD REPLYlink 2.5 years ago
janhuang.cn
• 150
Entering edit mode
0

Oh, interesting; that's unfortunate. Well, I still recommend writing a quick program to remove the duplicate RSIDs, as I mentioned above. But if there are only a handful you could easily remove all copies of them via grep instead.

ADD REPLYlink 2.5 years ago
Brian Bushnell
16k
Entering edit mode
0

It does not seem to be handful, and it is a large dataset. iterate through the VCF line by line sounds to be very slow, but I will see if I could do that. Thanks.

ADD REPLYlink 2.5 years ago
janhuang.cn
• 150
Entering edit mode
0

Any tool which accomplished the task would have to iterate through the VCF line by line, though :)

ADD REPLYlink 2.5 years ago
Brian Bushnell
16k
Entering edit mode
0

Have you solved the duplicated problem?

You gave a example that rs7410429 was reported twice, but the chr:pos are different, one is chr22:18003597, another is chr22:18004254 in the 1000 Genome vcf file.

I ran into the same situation. I found rs13406140 (in chromosome2) occurs two times in the 1000 Genome vcf and the coordinate of the same RSID is unbelievably different. as follows: 2 90430223 rs13406140 G A 100 PASS 2 91651998 rs13406140 A G 100 PASS

I queried my doubt in 1000 Genome offficial Q&A and found a likely reply: Why are there duplicate calls in the phase 3 call set http://www.internationalgenome.org/category/variants/

I'm still in doubt about this, how can a RSID SNP map to two different position? Can anyone help?

ADD REPLYlink 19 months ago
keryruo
• 10

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0