Question

Trouble running vcf2phyloviz.py

0

Entering edit mode

6.4 years ago

Thomas ▴ 160

Dear Biostars community,

My aim is to implement the following pipeline (albeit modified) in order to generate phylogenetic trees from whole-genome sequencing of multiple closely related bacterial samples:

http://userweb.eng.gla.ac.uk/cosmika.goswami/snp_calling/SNPCalling.html

However - I have a problem when I reach the stage of the pipeline where I use vcf2phyloviz.py (https://github.com/nickloman/misc-genomics-tools/blob/master/scripts/vcf2phyloviz.py). The aim of this script is to generate a pseudo-alignment of called SNPs from a merged VCF file e.g.

>ref
GCGGCNGTGGCGAGTGGCAGG
>Sample1
GCGGCATTTGCTGATGGTAGG
>Sample3
TTAATAGTGATGAGCTACGAA
>Sample2
GTAGCNGAGATGAGCTACGAA

The problem I am having is that logic in the vcf2phyloviz script seems to indicate that a SNP base position will not be appended to the alignment unless all sample sequences (contained in the VCF) have a nucleotide differing from the reference (i.e. all samples contain a SNP - though not all necessarily the same SNP).

This would force a psuedosequence of SNPs in which all nucleotides in the sample sequence would necessarily differ from the reference nucleotides

I am not sure if this is the intended behaviour of the script, or I am just running the script incorrectly. I have tried for many hours to identify the potential error that I could have made, but I can't seem to identify any.

The command that I run is as follows;

python2.7 vcf2phyloviz.py --metadata metadata.txt --output_prefix snps example.vcf

The first record of my input file looks like this (remaining header information omitted):

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  bam/Escherichia_coli.HUSEC2011CHR1.dna.root-1-1-1.A.fq.sam.bam.variant8 bam/Escherichia_coli.HUSEC2011CHR1.dna.root-1-1-2.A.fq.sam.bam.variant4 bam/Escher
ichia_coli.HUSEC2011CHR1.dna.root-1-1.A.fq.sam.bam.variant10  bam/Escherichia_coli.HUSEC2011CHR1.dna.root-1-2-1.A.fq.sam.bam.variant12        bam/Escherichia_coli.HUSEC2011CHR1.dna.root-1-2-2.A.fq.sam.bam.variant11        bam/
Escherichia_coli.HUSEC2011CHR1.dna.root-1-2.A.fq.sam.bam.variant9   bam/Escherichia_coli.HUSEC2011CHR1.dna.root-1.A.fq.sam.bam.variant14    bam/Escherichia_coli.HUSEC2011CHR1.dna.root-2-1-1.A.fq.sam.bam.variant3 bam/Escherichi
a_coli.HUSEC2011CHR1.dna.root-2-1-2.A.fq.sam.bam.variant6 bam/Escherichia_coli.HUSEC2011CHR1.dna.root-2-1.A.fq.sam.bam.variant15  bam/Escherichia_coli.HUSEC2011CHR1.dna.root-2-2-1.A.fq.sam.bam.variant5 bam/Escherichia_coli.HUS
EC2011CHR1.dna.root-2-2-2.A.fq.sam.bam.variant2 bam/Escherichia_coli.HUSEC2011CHR1.dna.root-2-2.A.fq.sam.bam.variant13  bam/Escherichia_coli.HUSEC2011CHR1.dna.root-2.A.fq.sam.bam.variant7     bam/Escherichia_coli.HUSEC2011CHR1
.dna.root.A.fq.sam.bam.variant

Chromosome      191     .       T       G       212.68  .       AB=0;ABP=0;AC=1;AF=1.00;AN=1;AO=10;CIGAR=1X;SDP=10;SDPB=10;SDPRA=0;EPP=3.87889;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=48.972;PAIRED=1;PAIRE
DR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=282;QR=0;RO=0;RPL=4;RPP=3.87889;RPPR=0;RPR=6;RUN=1;SAF=10;SAP=24.725;SAR=0;SRF=0;SRP=0;SRR=0;TYPE=snp;set=variant3    GT:AO:SDP:SDPR:PL:QA:QR:RO      .:.:.:.:.:.:.:. .:.:.:.:.:.:.:. .:.:.:.:.:.:
.:. .:.:.:.:.:.:.:. .:.:.:.:.:.:.:. .:.:.:.:.:.:.:. .:.:.:.:.:.:.:. 1:10:10:10,10:257,0:282:0:0     .:.:.:.:.:.:.:. .:.:.:.:.:.:.:. .:.:.:.:.:.:.:. .:.:.:.:.:.:.:. .:.:.:.:.:.:.:. .:.:.:.:.:.:.:. .:.:.:.:.:.:.:.

So for, example, this record would be omitted for the alignment because not all samples contains a SNPs (i.e. a nucleotide differing from the reference)

The logic in the vcf2phyloviz.py script which I think is casing the omission of these records are as follows:

for record in vcf_records:
        samples_to_use = [sample for sample in record.samples if sample.sample not in ignore_samples]
        bases = [sample.gt_bases for sample in samples_to_use if sample.gt_bases is not None]

        if len(bases) != len(samples_to_use):
            continue

The problem seems to be that if there is no entry for a particular sample of a given record - in my particular case - sample.gt_bases does not return any nucleotides - as a result, the subsequent test for a full base list fails.

So my question is as follows:

Is this the intended behaviour of the vcf2phyloviz.py script, or am I making some form of mistake?

Thanks,

NB: I have at the example case given in the URL I have given at the top of this post, and their example output they give for use of vcf2phyloviz indicates that this is not the expected behaviour of this script i.e. they produce alignments containing nucleotides which match the reference sequence

vcf2phyloviz vcf PyVCF • 1.6k views

ADD COMMENT • link 6.4 years ago by Thomas ▴ 160

score 1 · Accepted Answer · 2017-11-28

OK - so I think the problem here is that vcf2phyloviz.py is expecting data from each sample for each record.

I added the following workaround (at about line 75 in the script) in order to force the script to substitute the reference base for all samples lacking data (i.e. no SNP) for any given entry:

bases = []

for sample in samples_to_use:
    if sample.gt_bases is None:
        bases.append(record.REF)
    else:
        bases.append(sample.gt_bases)

NB: I also needed to add similar logic a few lines downstream in order to avoid errors, however, I have omitted this code for the sake of brevity