Question

annotating TCGA VCF file

1

Entering edit mode

7.7 years ago

jan ▴ 170

HI,

I recently obtained TCGA VCF files to search for germline variants. The variants were called by Washington University using several callers i.e Samtools, Sniper, Varscan, and strelka , which were separately lumped into one VCF file. Upon checking the files, most of the variants called by all callers except Varscan are uninformative . So I can only annotate variants that were called by Varscan.

This is how the variant header looks like :

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TCGA-PG-A914-01A-11D-A37N-09 TCGA-PG-A914-01A-11D-A37N-09-[Samtools] TCGA-PG-A914-10A-01D-A37N-09 TCGA-PG-A914-10A-01D-A37N-09-[Sniper] TCGA-PG-A914-01A-11D-A37N-09-[Sniper] TCGA-PG-A914-10A-01D-A37N-09-[VarscanSomatic] TCGA-PG-A914-01A-11D-A37N-09-[VarscanSomatic] TCGA-PG-A914-10A-01D-A37N-09-[Strelka] TCGA-PG-A914-01A-11D-A37N-09-[Strelka]

The problem comes when the format column is not consistent. These are all the formats in the VCF files.

GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC:FT:FA:AD:FDP:SDP:SUBDP:AU:CU:GU:TU GT:GQ:DP:BQ:MQ:AD:FA:VAQ:SS:FT:IGT:DP4:BCOUNT:JGQ:AMQ:SSC:FDP:SDP:SUBDP:AU:CU:GU:TU GT:DP:DP4:BQ:FA:VAQ:SS:FT:AD:FDP:SDP:SUBDP:AU:CU:GU:TU:GQ:MQ:IGT:BCOUNT:JGQ:AMQ:SSC GT:GQ:DP:BQ:MQ:AD:FA:VAQ:SS:FT:DP4:FDP:SDP:SUBDP:AU:CU:GU:TU GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC:FT:AD:FDP:SDP:SUBDP:AU:CU:GU:TU:FA GT:DP:DP4:BQ:FA:VAQ:SS:FT:AD:FDP:SDP:SUBDP:AU:CU:GU:TU:GQ:MQ GT:GQ:DP:BQ:MQ:AD:FA:VAQ:SS:FT:IGT:DP4:BCOUNT:JGQ:AMQ:SSC GT:DP:DP4:BQ:FA:VAQ:SS:FT:AD:FDP:SDP:SUBDP:AU:CU:GU:TU GT:GQ:DP:BQ:MQ:AD:FA:VAQ:SS:FT:FDP:SDP:SUBDP:AU:CU:GU:TU:DP4 GT:AD:BQ:SS:DP:FDP:SDP:SUBDP:AU:CU:GU:TU:FT:DP4:FA:VAQ:GQ:MQ:IGT:BCOUNT:JGQ:AMQ:SSC GT:DP:DP4:BQ:FA:VAQ:SS:FT:AD:FDP:SDP:SUBDP:AU:CU:GU:TU:IGT:BCOUNT:GQ:JGQ:MQ:AMQ:SSC GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC:FT:AD:FA:FDP:SDP:SUBDP:AU:CU:GU:TU GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC:FT:FA GT:AD:BQ:SS:DP:FDP:SDP:SUBDP:AU:CU:GU:TU:FT:DP4:FA:VAQ GT:DP:DP4:BQ:FA:VAQ:SS:FT:GQ:MQ:AD:IGT:BCOUNT:JGQ:AMQ:SSC GT:IGT:DP:DP4:BCOUNT:GQ:JGQ:VAQ:BQ:MQ:AMQ:SS:SSC:FT:AD:FA GT:AD:BQ:SS:DP:FDP:SDP:SUBDP:AU:CU:GU:TU:FT:DP4:FA:VAQ:GQ:MQ GT:DP:DP4:BQ:FA:VAQ:SS:FT:IGT:BCOUNT:GQ:JGQ:MQ:AMQ:SSC GT:AD:BQ:SS:DP:FDP:SDP:SUBDP:AU:CU:GU:TU:FT:DP4:FA:VAQ:IGT:BCOUNT:GQ:JGQ:MQ:AMQ:SSC GT:DP:DP4:BQ:FA:VAQ:SS:FT:IGT:BCOUNT:GQ:JGQ:MQ:AMQ:SSC:AD GT:GQ:DP:BQ:MQ:AD:FA:VAQ:SS:FT:DP4 GT:DP:DP4:BQ:FA:VAQ:SS:FT GT:DP:DP4:BQ:FA:VAQ:SS:FT:GQ:MQ:AD

I'm not too sure what are the strategies to annotate these kind of VCF files and would really appreciate any help if you have encountered this kind of VCF formatting.

Disclaimers: 1) I have the right authorization to use the data 2) I have emailed TCGA regarding this issue and no solution was given 3) I have emailed Washington University a few weeks ago and haven't received any reply

TCGA VCF Whole exome sequencing germline • 3.7k views

ADD COMMENT • link updated 7.7 years ago by Chris Miller 22k • written 7.7 years ago by jan ▴ 170

0

Entering edit mode

How are you interested in annotating the VCF -- with another program like snpEff or vep, or with custom scripts? The former should not be a problem if the VCF is valid; for the latter, try a VCF parsing library like pyvcf which will keep track of the format tags for you.

ADD REPLY • link 7.7 years ago by Eric T. ★ 2.8k

score 1 · Answer 1 · 2016-07-27

1

Entering edit mode

7.7 years ago

Chris Miller 22k

I'm not sure who you emailed at WashU, but I can offer a partial response.

1) I'd recommend starting from the MAF files, rather than the VCFs, unless you're looking at WGS data. They are better curated lists of variants.

2) you can convert those to VCF using one of several available tools like https://github.com/mskcc/vcf2maf/blob/master/maf2vcf.pl

3) It's probably not an awful idea to reannotate using something like VEP or vcfanno.

ADD COMMENT • link 7.7 years ago by Chris Miller 22k

0

Entering edit mode

Thank you for your reply.

I just submitted a contact form through The McDonnell Genome Institute Websitey website .

It's quite difficult to navigate the new TCGA portal and there is no option to get MAF files for whole exome sequencing data. There are only BAM and VCF files. I found a website in this forum that explained where to get new MAF files.

https://wiki.nci.nih.gov/display/TCGA/TCGA+MAF+Files .

However I could see that all the MAF files are for somatic variants, which is not useful for me if they don't contain germline variants. Also, the number of MAF files doesn't match the total number of cases (only ~half of total number of UCEC cases).

The VCF files will be annotated by the bioinformatics team at my institute using own pipeline which incorporate snpEff and other tools.

ADD REPLY • link 7.7 years ago by jan ▴ 170

0

Entering edit mode

Use vcf2maf in the repo that Chris pointed to. It was tested on this complex VCF similar to what you're dealing with. You can use it with the lumped VCF, if you specify --tumor-id and --normal-id as the names of the genotype columns for VarScan.

ADD REPLY • link 7.7 years ago by Cyriac Kandoth 6.0k