Ensembl variant effect predictor fails if REF or ALT allele has a length around 0.5 million
13 months ago
mo.imranshah • 10

I have been using VEP (variant effect predictor) from Ensembl for annotating VCFs produced by GATK's haplotype caller and PINDEL. The VEP is failing for some of the VCFs with the following error:

> -------------------- EXCEPTION --------------------
ERROR: Forked process(es) died: read-through of cross-process communication detected

>STACK Bio::EnsEMBL::VEP::Runner::_forked_buffer_to_output vep/version95/modules/Bio/EnsEMBL/VEP/
STACK Bio::EnsEMBL::VEP::Runner::next_output_line vep/version95/modules/Bio/EnsEMBL/VEP/
STACK Bio::EnsEMBL::VEP::Runner::run vep/version95/modules/Bio/EnsEMBL/VEP/
STACK toplevel vep/version95/vep:225
Date (localtime)    = Thu May  9 13:25:54 2019
Ensembl API version = 95 

It took me weeks to rectify the actual cause of this error as I was not able to find the solution on forums. I have tried adjusting the --buffer and --forks parameters as suggested on several forums but no success. It turns out to be an issue of REF and ALT alleles size for some variant. When I excluded the records with ALT/REF alleles' length more than 1000, I have got the results without any error.

VEP offline command used is:

vep --buffer_size 1000 --offline -i dataset_22336.dat -o dataset_22337.dat --cache --dir vep/database/ --force_overwrite --merged --cache_version 95 --assembly GRCh38 --fasta Homo_sapiens.GRCh38.dna.primary_assembly.fa --fork 32 --everything --vcf

What could be a possible solution to run VEP on the records with ALT/REF alleles' length in 0.5 to 2 million? Any help would be much appreciated.

Thanks in advance. Tagging @ Emily_Ensembl

do you really want to annotate a variant with this length ?

Pierre Lindenbaum, Could you please suggest what would be the optimal length to go with and exclude insignificant variants.


3.7 years ago
Ben_Ensembl • 980

Hi mo.imranshah,

There are difficulties in handling long allele strings (>1000bp) for variants in VEP when fetching everything that overlaps the allele string and probably this was what lead to the fork failing.

We plan to look more into it to figure out exactly what would be an 'upper limit' and how to handle these cases better.

However, it may be more efficient to upload your data into the Ensembl browser to visualise the genomic regions of interest:

or to use BioMart to retrieve the list of genes in the genomic regions of interest:

Best wishes

Ben Ensembl Helpdesk

Thanks Ben for a quick reply. Hope to see a better performance of VEP in such cases. Meanwhile, I would go with your suggestions.

Best, Imran


