Splitting VCF file to decrease file size to run it on VEP and wANNOVAR
4
1
Entering edit mode
5.5 years ago
S AR ▴ 80

I have a VCF file from a GDM patient it contains snps and indels from 1 sample only and i want to split it so that it size reduce to the size required by these tools online without getting the VCF format disruption. Any suggestions?

vcf • 4.7k views
ADD COMMENT
2
Entering edit mode
ADD REPLY
0
Entering edit mode

Hello S AR,

Don't forget to follow up on your threads.

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.

Upvote|Bookmark|Accept

ADD REPLY
6
Entering edit mode
5.5 years ago

Shorter:

bgzip variants.vcf 
tabix variants.vcf.gz 
tabix -l variants.vcf.gz | parellel -j 5 'tabix -h  variants.vcf.gz {} > {}.vcf'

# annotate, creating annot_chr*.vcf
bcftools concat annot_chr*.vcf > annot_variants.vcf

From tabix manuals:

-l, --list-chroms List the sequence names stored in the index file.

ADD COMMENT
4
Entering edit mode
5.5 years ago

I split by chromosome for things like that, using bgzip, tabix, unix commands, bcftools and gnu parallel (optional)

bgzip variants.vcf 
tabix -p vcf variants.vcf.gz 
zgrep -v '^#' variants.vcf.gz  | cut -f1 | sort -u > chromosomes.txt
cat chromosomes.txt | parallel -j 5 --bar 'tabix variants.vcf.gz {} > {}.prevcf'
zgrep '^#' variants.vcf.gz > header
ls *.prevcf | parallel -j 5 'cat header {} > {.}.vcf'
rm *.prevcf
# annotate, creating annot_chr*.vcf
bcftools concat annot_chr*.vcf > annot_variants.vcf
ADD COMMENT
1
Entering edit mode

WouterDeCoster some one posted a cool trick in getting chromosomes. After indexing, executing tabix -l variants.vcf.gz would list the chromosomes in vcf.

Edit: It is Fin :).

ADD REPLY
0
Entering edit mode

Yes I'd definitely recommend the answer of finswimmer: C: Splitting VCF file to decrease file size to run it on VEP and wANNOVAR

ADD REPLY
0
Entering edit mode

Wow.. That's great. I will try this and then ill update here. Thank you so much.

ADD REPLY
3
Entering edit mode
5.5 years ago

Try vcftools:

for i in chr{1..22};do echo vcftools --chr $i --vcf input.vcf --recode -INFO-all --out $i.vcf;done

Remove echo when you are ready to execute.

If you are okay with gnu-parallel and vcftools, you can try this:

$ parallel --dry-run vcftools --chr {} --vcf input.vcf --recode -INFO-all --out {}.vcf ::: chr{1..22}

remove dry-run when you are ready to execute.

ADD COMMENT

Login before adding your answer.

Traffic: 1528 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6