Question

Prioritize structural variation (SV) callings supported by multiple softwares

0

Entering edit mode

8.0 years ago

michealsmith ▴ 790

I don't know how people usually further filter structural variations, after making callings using multiple softwares (Pindel, CNVnator, Delly, Breakdancer, StriP.....you name it) ?

I'd like further prioritize those supported (reciprocal 70%) by at least two softwares. Does that sound reasonable? Intuitively I think so because SV identifications are still very noisy and technologically very challenging compared to SNP calling. I also saw some paper did the same (http://www.nature.com/articles/srep18501). But if I'm correct, in 1000Genome Project they "merged" to come up with a union based on output from 9 algorithms.

I proposed the above to my professors (they are geneticists but not bioinformaticians, basically have zero knowledge about NGS-based SV calling) and they thought "supported by at least two softwares" or "70% reciprocal overlap" are simply too arbitrary. How would I explain and convince them? Personally I think lots of procedures in SV calling and filter are indeed very arbitrary given the complex nature of structural variation and challenge for NGS-based calling.

SV structural variation intersect • 2.6k views

ADD COMMENT • link updated 7.9 years ago by QVINTVS_FABIVS_MAXIMVS ★ 2.5k • written 8.0 years ago by michealsmith ▴ 790

1

Entering edit mode

Here are some thoughts about this:

(1) The most accurate consensus approach is probably merging across sequencing technologies such as merging an illumina and a PacBio call set.

(2) If you have only illumina data available then calls reported by multiple orthogonal methods should be better (e.g., merging a read-depth call set with a paired-end call set instead of merging 2 paired-end call sets).

(3) Assuming you are doing germline SV calling in a larger cohort then (a) create a separate population call set using each caller, (b) assess the FDR of each call set and (c) merge only call sets that pass a pre-defined FDR threshold.

(4) I actually do agree with your professors that current merging procedures are ad hoc and not satisfactory but still commonly applied because they appear to show an accuracy advantage (e.g. see MetaSV, PMID: 25861968). The merging itself can possibly be improved: (a) Require a reciprocal overlap threshold and a max. breakpoint offset (b) For common SVs, you can require in addition genotype concordance (c) You can try a repeat-aware merging of SVs (see PMID: 25979471).

ADD REPLY • link 8.0 years ago by trausch ★ 1.9k

0

Entering edit mode

Many thanks!

(1) Merging an illumina and Pacbio, here by "merge" you mean intersection or union?

(2) I only have illumina, again, "merging a read-depth call set with a paired-end", here by "merge" you mean intersection or union? Also, I ran CNVnator as read-depth caller, but I found Delly/Breakdancer share very few intersection with CNVnator. (But I pre-filtered CNVnator output setting certain p-value shreshold)

(3) I'm not doing germline SV calling in tumor, but similarly I'm trying to find rare/novel SV for neurological disease by filtering out common deletions in 1000Genome project. Btw, How to assess FDR? Using well-sequenced samples like NA12828? Or you mean to validate SV for my own samples using array/PCR?

ADD REPLY • link 8.0 years ago by michealsmith ▴ 790

0

Entering edit mode

For (1) and (2) I meant the intersection using a rec. overlap threshold and a max. breakpoint offset as constraints. For (3) an FDR estimate using your cohort of samples is better than using "only" NA12878 but not always feasible, of course.

ADD REPLY • link 8.0 years ago by trausch ★ 1.9k

score 1 · Answer 1 · 2016-04-11

1

Entering edit mode

8.0 years ago

harold.smith.tarheel ★ 4.9k

The answer depends on your objective. Intersection would improve precision (fewer false-positives), but with a significant loss in sensitivity (fewer true-positives). Union would have the opposite effect. Most of the tools are relatively insensitive, so union/ensemble calling is typical (e.g., Brad Chapman's comparative analysis here nicely illustrates the point).

ADD COMMENT • link 8.0 years ago by harold.smith.tarheel ★ 4.9k

0

Entering edit mode

Many thanks. I always think precision is a bigger challenge given many repetitive sequences not resolved by short reads. But problem is, even if I do overlap/intersect, I manually checked the local bam file in IGV for each SV, I could still find lots of obvious false positive. I can't imagine the false positive if I do union.

Btw, I'm trying to identify rare or novel SV by filtering out common ones in 1000 Genome project.

ADD REPLY • link 8.0 years ago by michealsmith ▴ 790

score 0 · Answer 2 · 2016-05-10

For CNV's you can use my machine learning algorithm which will assign copy number (and genotype likelihoods) to deletions and duplications.

gtCNV

A paper is in the works detailing the method. However our group uses two methods to filter CNV (complex variation is a bit tricky and we used some heuristics). SVtyper is a good method but it is slow. We used my algorithm which is less refined for duplications and blind to complex events, but much faster and orthogonal to SVtyper.

As people mentioned the general method of merging and filtering is flawed. Some SV callers are more sensitive to NAHR events that might be missed. gtCNV attempts to solve this issue by accepting an unfiltered list of CNV and assigns a genotype likelihood for each variant.

Then you can start applying filters. My code also annotates the resulting VCF with 1000 Genomes phase 3 CNV positions, genes, repeats, and MEIs.

Caveats include: NAHR duplications are hard (but I'm working on that now!) and that any locus with a copy number >10 can not be genotyped (so there is some filtering)

For details on how we apply CNV and complex variant prioritization check out our paper Frequency and Complexity of De Novo Structural Mutation in Autism.