Format in vcf files in Geuvadis data not recognized
1
1
Entering edit mode
5.8 years ago
jamespower ▴ 100

Hi,

I have downloaded Geuvadis genotypes from:

https://www.ebi.ac.uk/arrayexpress/files/E-GEUV-1/GEUVADIS.chr22.PH1PH2_465.IMPFRQFILT_BIALLELIC_PH.annotv2.genotypes.vcf.gz

And I am trying to recreate vcf files with only the European samples using bcftools, but when I try

bcftools view --samples-file 373_sampleIDs.tab GEUVADIS.chr22.PH1PH2_465.IMPFRQFILT_BIALLELIC_PH.annotv2.genotypes.vcf.gz

I get this error:

 [W::vcf_parse_format] FORMAT 'PP' is not defined in the header, assuming Type=String
 [W::vcf_parse_format] FORMAT 'BD' is not defined in the header, assuming Type=String
Undefined tags in the header, cannot proceed in the sample subset mode.

Trying to fix it with bcftools reheader does not work:

bcftools view -h file.vcf > header.txt; bcftools reheader -h header.txt file.vcf > fixed.vcf

I can find information on the FORMAT "PP", but not on the FORMAT "BD".

##FORMAT=<ID=PP,Number=G,Type=Integer,Description="Phred-scaled Posterior Genotype Probabilities">

Would anybody be able to help? Thanks!

samtools format vcf bcftools geuvadis • 2.4k views
ADD COMMENT
0
Entering edit mode

W::

This is a warning, not an error. If the results is what you expect then everything is okay.

ADD REPLY
0
Entering edit mode

Even if it says it as a warning, I cannot do certain operations such as extracting sample IDs (I update the last output for the error message).

ADD REPLY
1
Entering edit mode

Hello jamespower,

bcftools is very strict about header informations. So if you want to use it, you must fixe this information or use other programs that are not that strict like snpSift.

The header you've found for PP says it is from the type Integer. But in the message you show before it expect a String.

Have a look into the vcf specifation and modify your header, so that the missing fields are in and have the right type. Also make sure that the value for Number is correct.

If you show us the complete header of your current file and the first variants, we could help you doing this.

fin swimmer

ADD REPLY
3
Entering edit mode
5.8 years ago

the following command seems to fix the error. But I've no idea about the description of the new FORMAT. There are also some extra spaces in the INFO column...

 wget -O - "https://www.ebi.ac.uk/arrayexpress/files/E-GEUV-1/GEUVADIS.chr22.PH1PH2_465.IMPFRQFILT_BIALLELIC_PH.annotv2.genotypes.vcf.gz" |\
gunzip -c |\
awk '/^#CHROM/ {printf("##FORMAT=<ID=BD,Number=1,Type=String,Description=\"?\">\n##FORMAT=<ID=PP,Number=1,Type=String,Description=\"?\">\n");} {print}' |\
sed 's/ damaging/_damaging/g'  > fixed.vcf
ADD COMMENT

Login before adding your answer.

Traffic: 1556 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6