Taking genotypes out of a vcf file
2
0
Entering edit mode
8.7 years ago
Jautis ▴ 530

Hi, I would like to take a vcf file and output a tab-delimited file with a line per individual-site that includes the individual, site, GT, alternate and reference alleles, DP, and each PL. It doesn't matter what order sites end up being in.

Any suggestions on how to do this most efficiently and/or any links to code that does something like this?

Example input:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  samp1 samp2
chr1   100  .       C       T       3106.72 SnpCluster      .       GT:AD:DP:GQ:PL  0/0:1,0:1:3:0,3,42      0/0:3,0:3:9:0,9,132
chr1   120  .       C       G       3106.72 SnpCluster      .       GT:AD:DP:GQ:PL 0/1:3,1:4:30:30,0,123   1/1:0,1:1:3:45,3,0

Example output:

samp1    chr1    100    0/0   C   T   1    0   3   42
samp2    chr1    100   0/1    C   T   4   30   0   123
samp1    chr1    120    0/0   C   G   3    0    9    132
samp2    chr1    120    1/1   C   G   1    45   3    0
vcf • 3.5k views
ADD COMMENT
0
Entering edit mode

If you're into python, pyvcf makes this sort of task quite easy.

ADD REPLY
1
Entering edit mode
8.7 years ago

Using my tool Bioalcidae: https://github.com/lindenb/jvarkit/wiki/BioAlcidae

file format.js

while(iter.hasNext())
    {
    var ctx = iter.next();
    for(var I = 0;i< ctx.getNSamples();++i)
        {
        var g = ctx.getGenotype(i);
        out.print(g.getSampleName());
        out.print("\t");
        out.print(ctx.contig);
        out.print("\t");
        out.print(ctx.start);
        out.print("\t");
        if(g.isCalled())
            {
            out.print(ctx.getAlleleIndex(g.getAllele(0))+"/"+ctx.getAlleleIndex(g.getAllele(1)));
            out.print("\t");
            out.print(g.getAllele(0).getDisplayString());
            out.print("\t");
            out.print(g.getAllele(1).getDisplayString());
            }
        else
            {
            out.print(".\t.\t.");
            }
        out.print("\t");
        out.print(g.getDP());
        out.print("\t");
        out.print(g.getGQ());
        out.print("\t");
        out.print(g.getAnyAttribute("NR"));
        out.println();
        }
    }

Invoke:

curl -sL "https://raw.githubusercontent.com/KBoehme/VTCTesting/master/src/test/resources/test_data/UnionTests/Test4/input2.vcf" | java -jar dist-1.133/bioalcidae.jar -F VCF -f format.js

Result:

NA00001    chr1    1451489    0/0    T    T    -1    35    18
NA00002    chr1    1451489    1/0    C    T    -1    100    24
NA00001    chr1    1451490    0/0    G    G    -1    35    18
NA00002    chr1    1451490    0/1    G    A    -1    100    24
NA00001    chr1    12939904    0/0    A    A    -1    45    53
NA00002    chr1    12939904    0/1    A    C    -1    100    79
NA00001    chr1    16902943    0/0    T    T    -1    26    93
NA00002    chr1    16902943    0/1    T    C    -1    84    98
NA00001    chr1    16902982    0/0    G    G    -1    52    108
NA00002    chr1    16902982    0/1    G    T    -1    75    115
(...)
ADD COMMENT
0
Entering edit mode

Thank you very much! That seems perfect. However, I'm having difficulties getting a working copy of bioalcidae.jar. When I download the .jar file, I get an error message saying it is invalid or corrupt. Is there anything else I need to do?

ADD REPLY
1
Entering edit mode

(?!) there is no jar fileto download, you need to install and compile the program ;

git clone "https://github.com/lindenb/jvarkit.git" && cd jvarkit &&  make bioalcidae && java -jar dist-1.133/bioalcidae.jar -h
ADD REPLY
0
Entering edit mode

Oh, thank you. I feel like an idiot now

ADD REPLY
0
Entering edit mode
8.7 years ago

I'm partial to the vawk tool. It's simple, but gets the job done quickly.

ADD COMMENT

Login before adding your answer.

Traffic: 2842 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6