This is, I'm afraid, one of those open-ended questions whereby you'll get 2 different answers if you ask 2 different people.
From just looking at the VCF, it's next-to-impossible to relate quality to the entire sequencing run because a VCF may or may not be heavily filtered along the way. If you're lucky, and standard tools were used, then at least all filtering applied to the 'raw' VCF will be recorded in the VCF header, but even a 'raw' VCF may have been generated from a heavily filtered BAM/SAM and thus conceal much information about the overall run and quality of data.
For example, I've been working from assumption like:
- If the BAM file or even the VCF/BCF files have a very low, or a very broad curve for the DEPTH...the entire data file may be
invalid and throwing it out should be considered.
- If the INDEL lengths, QUAL, or GQ do not fall in a relatively normal bell curve, the data may be invalid.
'Very broad curve' is vague but I guess that you mean generally uneven depth of coverage? It would be incorrect to automatically assume that a sample was poor quality by just looking at this. The depth of coverage profile can be influenced by one or more of the following:
- target depth of coverage (obvious)
- difficulty in priming due to high GC content
- sequence similarity [to other regions of the genome]
- outdated reagents
- degraded DNA
- delays in the wet-laboratory processing of the sample
- et cetera.
Thus, there are many 'parameters' that go into the depth of coverage 'equation', and I believe that variations in depth of coverage are expected. You haven't elaborated on whether what you're observing is extreme variations in the profile or not(?).
I'm not sure that the indel profile should necessarily fall into a bell curve profile, and neither that of the QUAL nor GQ. There's a lot that indirectly goes into the calculation of these (QUAL and GQ), and sometimes the assigned values don't even make sense.
The sense of 'quality' for a sample and run is more a human feeling that should come from looking at a whole host of parameters. In order to make an honest decision on whether a run failed or not, I would love to see:
Wet lab
DNA concentration
Gel electrophoresis of DNA
Length of time DNA was in transit
Date the reagent kit was produced (and expiry date)
Sequencing
Bioinformatics
All programs and versions used to process the data, including the
base-caller in the sequencer
Total reads
Min/Max/Mean/Median/Upper-/Lower-quartile read length
Any QC or trimming applied to reads
Alignment % to reference genome
Genome version used for alignment
Mate-pairs mapped together
Reads aligned to >1 location
Singletons / Lone mates
Bioinformatics coverage and other QC
Number of reads off target (targeted sequencing only)
Min/Max/Mean/Median/Upper-/Lower-quartile read-depth per chromosome
and genome-wide
Plot of depth of coverage profile in bins (e.g. 50,000 bp) per chromosome and genome-wide
Bases with 0, <5, et cetera read-depth (and then summarised into regions that have same read-depth at each level)
Overall % genome covered at read depth 1, 2, 3, 4, 5, 10, 18, 20, 30,
et cetera.
Variant calling (to produce a VCF)
MAPQ filters
MAPQ bias
Phred-scaled base quality filters
Base quality bias
Strand bias
Downsampling performed (to what level?)
Min number of variant bases required to make a variant call
Min allelic fraction at which to call
heterozygous/homozygous variant
Read-end bias
Min total (ref+alt) read depth at which a variant is even reported in the VCF
This is just off the top of my head. The list is exhaustive...
Kevin
Adding to Kevin's comment, an unbiased evaluation would require large-scale confirmation by a second sequencing approach, e.g. traditional Sanger, but this is of course not feasible for most researchers, especially if the variant calling is only one of many tasks to be done during the project (and of course only possible if you created the data yourself, aka have the DNA in your freezer, rather then having downloaded it from a database). Therefore, you'll need to rely on the variant callers recommended settings (and only change the defaults if you have expert knowledge), as these should be derived from exactly these extensive validation efforts. Are you working on human data?
Thanks for the reply. I'm working on getting some of the data Kevin recommended. To answer your questions: - This is data from another lab (we don't have the DNA in our freezer). - It is human data (70 individual's whole genome) - The tools were run (to the best of my knowledge) using the "standard settings"...although even this seems a bit hard to find what standard recommendations are.
I'll include some of this info in my reply to Kevin :)
Yes, because there is standardisation in neither bioinformatics nor NGS. They must mean 'standard' in terms of their own in house laboratory settings.