Question

How many variants shall I expect in .vcf file?

1

Entering edit mode

6.5 years ago

Adam ▴ 40

Dear Biologists and Bio-IT-specialists,

based on your knoweledge and experience I'd like to know how many variants should be discovered from .BAM file (final result, after annotation and filtration)? This question is very general, so I'll specify it a bit.

Let's take some NGS data (fastq) from ENA databse. I performed whole analysis and here's my results:

WES data, paired end, Illumina HiSeq 2000, material - cancer cell line. Software that I used:

FastQC - (before and after trimming adapters)
Trimmomatic - trimming adapters
Subread - mapping to genome - hg19
Bamtools - filter, sort, index
FreeBayes - variant discovery
SnpSift i snpEff - filter and annotation of discovered variants.

After annotation of variants I run:

(QUAL > 1) & (QUAL / AO > 10) & (SAF > 0) & (SAR > 0) & (RPR > 1) & (RPL > 1)

as a filtering arugemnt of SnpSift. As a result I have .vcf file with 32866 variants. Do you think that this number might be correct for mentioned above criteria?

Why do I ask such question like this?

A few days ago I got several .vcf (ready-to-use) files of WES experiment of another cancer but from patients... Analysis were performed by some specialist that I don't know. I was a bit confused becaue each .vcf file contains 300000-350000 variants with filter PASS. To be honest I don't know the pipeline - of analysis that returned above results, however variant discovery was performed by SAMTools...

So my questions are:

Do you think that my analysis and tools that I used are correct? I'd like to avoid false negatives when some mutation is present and it's absent in my result .vcf file...

How many variants of WES experimrnt (final result) analysis do you usually have?

Best regards,

Adam

NGS vcf • 2.4k views

ADD COMMENT • link updated 6.5 years ago by ATpoint 82k • written 6.5 years ago by Adam ▴ 40

score 5 · Accepted Answer · 2017-10-23

The total number of variants really depends on the sample. For cell lines, as they are often in culture for decades, a plethora of mutations might have been acquired over the years which do not really have something to do with the underlying disease. IMHO, the only use of cell line variants is to check if they are suitable models for experiments in case that mutations from patients are also present in the line. Variant number also depends on the disease, as there are cancer types such as AML with low mutation rates, while others such as lymphoma and most skin- and lung cancers have high mutation rates. That having said, whatever the numbers in your cell line are, I think it is not representative for anything. Also keep in mind that most of these mutations are probably germline variants, which you cannot discriminate from somatic ones due to the lack of a matched-normal control. What you can do is to filter out common variants in the human population, e.g. removing all variants that are reported by the 1000Genomes project. The Ensembl Variant Effector Predictor (VEP) can do that for you. But as I said, cell line mutations must be considered carefully and cannot replace primary material.