Plink Error: Duplicate ID 'chr:pos:A1:A2' generated by --set-missing-var-ids
0
0
Entering edit mode
5.2 years ago
jamespower ▴ 100

Hi,

I am trying to work with the 1000Genomes data starting from the vcf file, which has missing IDs.

So, I thought of adding the flag --set-missing-var-ids @:#:\$1:\$2,

but this gives me the error:

Error: Duplicate ID generated by --set-missing-var-ids.

So I guess first we want to find the duplicate IDs, but by chr, pos, A1, and A2.

Is there a way to do this? I know there is a --list-duplicate-vars but this does not give me IDs since I have many missing IDs.

Any help would be very appreciated!

plink duplicates missing id • 6.6k views
ADD COMMENT
2
Entering edit mode

You're much better off using plink 2.0 for this operation, since (i) it lets you specify REF/ALT alleles when constructing the IDs, avoiding the duplicate indel problem, and (ii) it forces you to specify how very long allele codes should be handled instead of silently truncating them, preventing some nasty surprises later on.

It's also reasonable to use awk/cut for this, but if you do, you'll need to be careful with very long indels.

ADD REPLY
0
Entering edit mode

Thanks! Using plink2, with --set-missing-var-ids @:#\$r:\$a, and --new-id-max-allele-len 23 missing works great!

ADD REPLY
1
Entering edit mode

Have you considered a simple awk or cut followed by sort | uniq -c? It might also be worth looking into various options that bcftools offers.

ADD REPLY
0
Entering edit mode

Indeed, I deal with this issue outside plink and using BCFtools. Take a look at Step 4, here: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2

ADD REPLY

Login before adding your answer.

Traffic: 1818 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6