Question

What is a faster way than using bcftool's view to extract a population from 1000 Genome VCF files?

0

Entering edit mode

5.6 years ago

Volka ▴ 180

Hi all, I am currently trying to extract the East Asian population from the 1000 Genome Project VCF files. After trying out VCFtools and its --keep option and thinking it was taking too long (about an hour per chromosome), I moved on to using bcftools view with the following code:

for i in {1..22}
do
    bcftools view -S eas.txt -O z -o ALL.chr"$i".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.EAS.vcf.gz $thgenome/1KG_phase3_vcf/ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr"$i".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
done

Currently, it is processing about 5 chromosome VCF files each hour, but I am wondering if there is a faster program for this? Or is my code written inefficiently?

Thanks!

bcftools 1000 vcf subset extract • 3.1k views

ADD COMMENT • link updated 5.6 years ago by chrchang523 10k • written 5.6 years ago by Volka ▴ 180

zx8754 · Answer 1 · 2018-09-12

1

Entering edit mode

5.6 years ago

chrchang523 10k

Using the files from https://www.cog-genomics.org/plink/2.0/resources#1kg_phase3 :

plink2 --pfile ... --keep-if SuperPop == EAS --export vcf

should be a lot faster.

ADD COMMENT • link updated 5.6 years ago by zx8754 11k • written 5.6 years ago by chrchang523 10k

0

Entering edit mode

Hi, thanks for the reply! I was able to download the .pgen, .pvar and predigree corrected .psam files, and extracted them if there were zipped in a .zst format. Then I made sure the files were the same names and ran the following code:

plink2 --pfile all_phase3 --keep-if SuperPop==EAS --export vcf

However, I'm getting an "Error: Malformed .pgen file.". The output is below:

PLINK v2.00a1LM 64-bit Intel (11 Feb 2018)
www.cog-genomics.org/plink/2.0/ (C) 2005-2018 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to plink2.log. Options in effect: --export vcf --keep-if SuperPop==EAS --pfile all_phase3

Start time: Wed Sep 12 14:39:24 2018 64298 MB RAM detected; reserving 32149 MB for main workspace. Using up to 48 threads (change this with --threads). 2504 samples (1271 females, 1233 males; 2497 founders) loaded from all_phase3.psam. 84805772 variants loaded from all_phase3.pvar. 2 categorical phenotypes loaded. --keep-if: 2000 samples removed. 504 samples (260 females, 244 males; 504 founders) remaining after main filters. --export vcf to plink2.vcf ... 0% Error: Malformed .pgen file. End time: Wed Sep 12 14:40:52 2018

Do you have any suggestions on what's wrong?

ADD REPLY • link 5.6 years ago by Volka ▴ 180

1

Entering edit mode

This requires a more recent plink2 build. Alpha 1 did not support multiallelic variants.

ADD REPLY • link 5.6 years ago by chrchang523 10k

0

Entering edit mode

Oh, that makes sense. I downloaded the latest development build and it works perfectly now, thank you very much! It was very quick too. Is there anything else I would have note about how plink2 handles VCF files, other than what's noted about the reference alleles and phase information in the What's New page?

ADD REPLY • link 5.6 years ago by Volka ▴ 180

0

Entering edit mode

Some situations where you should still use other tools:

There are extra FORMAT fields like GQ and DP, and you still need to retain them (plink2 can filter on them, but it won’t save the original values).
There are {P(AA), P(AB), P(BB)} genotype likelihood triplets, and you need to retain all of these values instead of collapsing them into a single dosage.
There’s unusual ploidy to keep track of (e.g. trisomy 21).

ADD REPLY • link 5.6 years ago by chrchang523 10k