Biostar Beta. Not for public use.
Extracting allele, Genotype from VCF file
0
Entering edit mode
2.2 years ago

How to extract allel, Genotype from vcf file using python or other language for 23GB files? Well, I am able to right script to get allel but for large VCF file its difficult ? what should other possible way to get allel, Genotype information?

ADD COMMENTlink
1
Entering edit mode

try bcftools query .

ADD REPLYlink
0
Entering edit mode

how about VCFtools?

ADD REPLYlink
0
Entering edit mode

Why is this a tool post? A question about tools should be a question-type post, not a tool-type post.

ADD REPLYlink
0
Entering edit mode

What have you tried?

ADD REPLYlink
0
Entering edit mode

May help the user (AWK ideas): - https://www.biostars.org/p/291147/#291167 - https://www.biostars.org/p/298361/#298464 Actually, I have a Python script that can parse a VCF, in fact: https://www.biostars.org/p/302940/

ADD REPLYlink
0
Entering edit mode

Why have you replied to my comment, Kevin?

ADD REPLYlink
0
Entering edit mode

Did not want to create yet another 4th and independent comment

ADD REPLYlink
0
Entering edit mode

You can take a look at this two scripts wrote in python to split a vcf and select what you want : A: VCF file help and C: parsing vcf file

ADD REPLYlink
2
Entering edit mode
4 weeks ago
arup ♦ 1.3k
India

Extracting genotype information using R.

library(vcfR)
vcf <- read.vcfR(vcf_file, verbose = FALSE )
gt <- extract.gt(vcf, element = c('GT'), as.numeric = TRUE)

For python take a look at the following article.

http://alimanfoo.github.io/2017/06/14/read-vcf.html

Genotypes can also be extracted using SnpSift.jar in snpEff using the following command.

java -jar ../snpEff/SnpSift.jar extractFields annotated.vcf   CHROM POS REF ALT  "GEN[*].GT" > output.tsv
ADD COMMENTlink
1
Entering edit mode

Doesn't look like vcfR does streaming read, so I would not recommend it as it's not a great idea to build an in-memory object of an entire VCF file. A better strategy would be to use closer-to-bare-metal tools such as bcftools to extract information, then use R or Python to compute on extracted information.

ADD REPLYlink
1
Entering edit mode
4 months ago
Germany

See bcftools query.


EDIT: WIth bcftools query you can print any information you like. So in your case e.g.:

$ bcftools query -f '%CHROM %POS  %REF  %ALT [ %GT]\n' input.vcf

The output looks now like this:

chr1 10177  ACC  ACCC  0/1
chr1 10327  T  C  0/0
chr1 10352  TAC  TAAC  1/1
chr1 12783  G  A  1/1

fin swimmer

ADD COMMENTlink
0
Entering edit mode

I think this should be a comment, as it's more of a suggestion than a solution. See, for example, cpad's comment pointing to the same resource.

ADD REPLYlink
0
Entering edit mode

Hello Ram,

if an "answer" is just intended for full copy&paste solution then my post is indeed more a comment. But I thought that telling the tool with it's subcommand and linking to the good manual is an answer enough.

I extended my post now to an full answer :)

cpad was faster than me, right. I didn't saw his answer as I haven't reload the page.

fin swimmer

ADD REPLYlink
0
Entering edit mode

Hi finswimmer! Can I ask for the trick to convert the output to symbolic genotypes? for your example:

chr1 10177  ACC  ACCC  ACC/ACCC
chr1 10327  T  C  T/T
chr1 10352  TAC  TAAC  TAAC/TAAC
chr1 12783  G  A  A/A

Searched for a whole, but just did not have my luck.

ADD REPLYlink
2
Entering edit mode

Hello yifangt86 ,

that's also described in the manual I've linked to:

 $ bcftools query -f '%CHROM %POS  %REF  %ALT [ %TGT]\n' input.vcf

fin swimmer

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1