Question

Variant Calling in Low-Complexity Regions

3

Entering edit mode

6.5 years ago

ATpoint 81k

I am currently analyzing somatic SNVs in matched tumor-normal samples, and I would like to hear your opinions on filtering when it comes to variants in low-complexity (LC) regions. It was shown previously that LCs are hotspots for SNPs and especially Indels, potentially due to PCR amplification errors in long homopolymer stretches and alignment errors.

My pipeline so far used BWA mem for alignment to hg38, followed by mpileup/Varscan2, the Varscan2 false-positive filter and removal of variants annotated in the 1KG project. We downloaded our datasets without the possibility to confirm any of the variants.

Therefore I would like to ask for your opinion and experience on how reliable somatic SNPs in LCs are (LC regions obtained by using Heng Li's sdust implementation on the hg38 fasta). There are some reports out, e.g. from bcbio, who categorically exclude LC variants, but I am really wondering about the false-negative rate that one introduces. This is especially important as some of the regions (both coding and non-coding) we are interested in are (almost) fully located in LCs. Therefore I would have a hard time to categorically exclude them. In the end, we will have to confirm the interesting regions by targeted sequencing of a patient cohort, but for now I would appreciate your opinions.

Variants SNP Low-Complexity Indel Varscan • 3.1k views

ADD COMMENT • link updated 6.5 years ago by Biostar 20 • written 6.5 years ago by ATpoint 81k

1

Entering edit mode

Short read mapping is much less accurate in low-complexity areas. That said, short reads have gotten longer over the years, so be cautious of following procedures that were developed for reads that are substantially shorter than what you are using. It's interesting that the bcbio link you gave never mentioned read lengths or insert sizes, both of which are crucial in determining the reliability of mapping in low-complexity areas.

ADD REPLY • link 6.5 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks for the comment. In our case it is 2x100bp from a GAIIX, insert size about 400bp. Is there a very crude rule if thumb, like 'if the filtering was developped for 2x50, but we have 2x100 or more, we would overfilter? As there most probably won't be a simple rule of thumb, can you direct me to good literature on that matter or toss some key words I can follow up on?

ADD REPLY • link 6.5 years ago by ATpoint 81k

0

Entering edit mode

Hi , I have the same pipeline ( i mean is use bwa and varscan2 ) to do the same kind of tumor analyses (panel targeting 21 genes). We sequence it with life PGM , we have a lot of sequencing error in homopolymers and LC ( and I saw 1 pseudo-genes problems appearing 1 times). At the beginning i was like you about false negative , but after more than 1 000 samples , we see in most of case differences in strand bias + number of reads/ frequency ( for example in 90% of the samples we have a SNV but like insertion of TTT , the position concerned is covered around 2000X and in mean we have 20 reads foward and reverse reads each. For a positive mutation (i mean confirmed in SANGER) we will got like 200 reads muted foward and reverse reads each). I think the only way to filter false positive will be statistical population ....

ADD REPLY • link 6.5 years ago by Titus ▴ 910

0

Entering edit mode

We have 30-50x WGS data. So if I get you right, basically applying all the varscan filters, so strand bias, default p-values for the Fishers Exact and the fpfilter (--dream3-settings) should deal with most false-positives. An additional maximum-depth-like filter (like removing variants in regions with depth greater e.g. 99th percentile) is more advisable than categorical LC filtering?

ADD REPLY • link 6.5 years ago by ATpoint 81k

0

Entering edit mode

well i use varscan filter plus manual filter(an "homopolymers" calculation to check the complexity of the bases before and after the variant position) , but i think your case is different because you are with low coverage data ( we fixed (national recommendation) to 500X coverage to be "sure" (we still have case where we are in doubt in LC region) about SNV, so with 30X-50X to me you have same chance to skip involved sub clonal cells than have a false positive variant if you see what i mean ? )

I can't say to you what are the threshold for filter , but i think like in my case you can classify variants with rules using a certain numbers of samples sequenced , well i did it like that in my case ...

ADD REPLY • link 6.5 years ago by Titus ▴ 910

0

Entering edit mode

Sorry, I don't have any rules of thumb here. In general, for a given read length, you might want to find the regions that have an edit distance of under X from anywhere else in the genome, and avoid them. That's not really the same as low-complexity, but generally, it's the low-complexity regions and repeats that tend to be problematic in that regard.

ADD REPLY • link 6.5 years ago by Brian Bushnell 20k