How to extract allel, Genotype from vcf file using python or other language for 23GB files?
Well, I am able to right script to get allel but for large VCF file its difficult ? what should other possible way to get allel, Genotype information?
try bcftools query .
how about VCFtools?
Why is this a tool post? A question about tools should be a question-type post, not a tool-type post.
What have you tried?
May help the user (AWK ideas): - https://www.biostars.org/p/291147/#291167 - https://www.biostars.org/p/298361/#298464 Actually, I have a Python script that can parse a VCF, in fact: https://www.biostars.org/p/302940/
Why have you replied to my comment, Kevin?
Did not want to create yet another 4th and independent comment
You can take a look at this two scripts wrote in python to split a vcf and select what you want : A: VCF file help and C: parsing vcf file
Extracting genotype information using R.
vcf <- read.vcfR(vcf_file, verbose = FALSE )
gt <- extract.gt(vcf, element = c('GT'), as.numeric = TRUE)
For python take a look at the following article.
Genotypes can also be extracted using SnpSift.jar in snpEff using the following command.
java -jar ../snpEff/SnpSift.jar extractFields annotated.vcf CHROM POS REF ALT "GEN[*].GT" > output.tsv
Doesn't look like vcfR does streaming read, so I would not recommend it as it's not a great idea to build an in-memory object of an entire VCF file. A better strategy would be to use closer-to-bare-metal tools such as bcftools to extract information, then use R or Python to compute on extracted information.
See bcftools query.
EDIT: WIth bcftools query you can print any information you like. So in your case e.g.:
$ bcftools query -f '%CHROM %POS %REF %ALT [ %GT]\n' input.vcf
The output looks now like this:
chr1 10177 ACC ACCC 0/1
chr1 10327 T C 0/0
chr1 10352 TAC TAAC 1/1
chr1 12783 G A 1/1
I think this should be a comment, as it's more of a suggestion than a solution. See, for example, cpad's comment pointing to the same resource.
if an "answer" is just intended for full copy&paste solution then my post is indeed more a comment. But I thought that telling the tool with it's subcommand and linking to the good manual is an answer enough.
I extended my post now to an full answer :)
cpad was faster than me, right. I didn't saw his answer as I haven't reload the page.
Can I ask for the trick to convert the output to symbolic genotypes? for your example:
chr1 10177 ACC ACCC ACC/ACCC
chr1 10327 T C T/T
chr1 10352 TAC TAAC TAAC/TAAC
chr1 12783 G A A/A
Searched for a whole, but just did not have my luck.
Hello yifangt86 ,
that's also described in the manual I've linked to:
$ bcftools query -f '%CHROM %POS %REF %ALT [ %TGT]\n' input.vcf
Login before adding your answer.