Question

Lack of consensus between NGS & Sanger sequencing on indels/mutations

2

Entering edit mode

8.4 years ago

alons ▴ 270

Hi all!

We have a working NGS analysis variant calling pipeline for cancer which we've recently ran on a sample.

A certain mutation was found, specifically in chr3:178936095 (hg19), now as part of a confirmation process the lab ran a Sanger sequencing process on the mutation region and found out that a 2-based deletion and 1-base insertion wrt hg19, occurred in a close position, in chr3:178936116 and that the mutation we found seems like a false-positive.

Looking through the VCF files, I didn't find the deletion+insertion position, nor did I find it when I checked the BAM file.

We're using:

BWA MEM as an aligner. Our command for running is: bwa mem -t 12 -R <read groups info> <ref genome> <fastq files> > aligned.sam
Realigning around indels using GATK.
Calling variants with FreeBayes, GATK & bcf tools.
Currently there's no filter based on read quality other than the default one in bwa mem.

There was a consensus between all 3 on the mutation we found from the NGS data.

This has led me to think that the reads 'supporting' that hypothesis weren't aligned from the .fastq files as they were too different from the reference or got a 'lower score'.

My question is: Has something similar ever occurred to you? Is there a parameter/argument I can provide BWA MEM with so that those missing reads would be aligned? Otherwise, can you recommend a program/pipeline I can use to call these indels/insertions successfully?

Thank you very much in advance,

Alon

Sanger NGS variant-calling BWA-MEM indel • 4.9k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.4 years ago by alons ▴ 270

1

Entering edit mode

Was this a germline or somatic mutation?

ADD REPLY • link 8.4 years ago by User 59 13k

0

Entering edit mode

We're not sure, the only sample was from the tumor.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by alons ▴ 270

1

Entering edit mode

You do not provide much detail, you would probably get more helpful replies if you provide a better description of your analysis. Do you perform realignment around indels? You are aligning with bwa mem, but how are you calling the variants? Do you perform quality filtering on your NGS variants?

ADD REPLY • link 8.4 years ago by h.mon 35k

0

Entering edit mode

You're right, thanks. I've edited my question, hopefully now the info is sufficient.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by alons ▴ 270

0

Entering edit mode

I developed an analysis pipeline that has 100% sensitivity with Sanger. It's currently being used in a laboratory of the National Health Service England.

ADD REPLY • link 6.0 years ago by Kevin Blighe 87k

0

Entering edit mode

100% agreement sounds suspicious (no offence intended). Is this pipeline publicly accessible? Would be great to learn from well-tested procedures that used other than the standard test datasets like NA12878.

EDIT: If it is not public, could you tell which variant caller it uses as a workhorse?

ADD REPLY • link 6.5 years ago by ATpoint 81k

1

Entering edit mode

Yes, it is quite suspicious - when we presented the results at conferences (I presented in Manchester, England, and Dundee, Scotland), there was usually silence. Whilst it's got 100% agreement in the sense that it detects all Sanger-confirmed variants when the sample passes QC, it equally calls variants not encountered by Sanger, which could be false-positives or ones that Sanger missed.

We originally just used the GATK whilst validating/researching, but had to move variant caller due to licensing issues with the GATK for going live. So, now it's just open source samtools. The key (on top of good QC) is merely the 'random' selection of reads from your final aligned BAM to produce multiple 'sub-BAMs', each from which variants are then called independently. What you'll always find is a high quality variant(s) called in a lower sub-BAM that was seemingly missed in the original BAM. Try it out.

By producing sub-BAMs in this way, you increase your chances of calling all genuine variants. Unfortunately, when you do this, you're also calling variants from lower read-depths, so QC is always critical. We never called anything from below total read depth 18 (~97% sensitivity to Sanger at this depth), with the optimum total depth being 30 (100% sensitivity to Sanger).

To 'randomly' select reads like we did/do, one can use Picard DownsampleSam

[Edit: I never researched properly how/why seemingly blatant variants were not being called, but I put it down to genotype prior probabilities, i.e., prior probabilities of calling a variant at each position, which is influenced by read-depth. When you produce sub-BAMS, you 'reshuffle' the reads and give yourself a new chance of calling].

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

I am a little confused about what you said in EDIT.

The prior probability, P(Genotype), is defined in early study, such as 1KB or dbSNP, and the probability is fixed (sure the probability value depends on the read depth in that vcf file).

When using sub-BAM instead of entire BAM, the read-depth / allele frequency of your current sample changes. That changes the P(Data) and therefore changes the final P(Genotype|Data)

Is this what we are talking about?

ADD REPLY • link 6.0 years ago by CY ▴ 750

0

Entering edit mode

There are 2 kinds of prior / posterior probabilities here:

the 'standard' one that is fundamental to many variant callers (see 'Probabilistic methods' Here)
the prior probabilities that the GATK uses via the VariantRecalibration step

I am referring to the first type in my posting (above). As mentioned, I never had time to explore this further.

ADD REPLY • link 6.0 years ago by Kevin Blighe 87k

0

Entering edit mode

Yes, the first type is what I was talking about. It is a fixed probability depending on which reference you use.

I am not sure about the second type you mention. In my expression, GATK also use the first type as prior. That is why --dbsnp arguments is used in HC / Mutect

ADD REPLY • link 6.0 years ago by CY ▴ 750

1

Entering edit mode

8.4 years ago

Brian Bushnell 20k

If you are interested in detecting and calling indels, I recommend mapping with BBMap with default parameters, as it tends to call indels much more accurately than any other aligner. But it's not just the aligner that's important here - the variant-caller will play a role too. I'm not sure which is currently best, but there are several alternatives - mpileup, GATK, and freebayes (although unfortunately freebayes cannot process the current SAM specification).

You may want to look at your bam file in IGV to see if the problem is that the reads were unmapped or incorrectly mapped (and thus there are no reads indicating indels), or whether the problem is the variant caller (in which case IGV would show reads indicating the indels).

Edit - specifically, if you want to use freebayes with BBMap, use BBMap's sam=1.3 flag to generate old-style cigars strings with M instead of = and X.

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Brian Bushnell 20k

1

Entering edit mode

I have used freebayes on sam files generated by bbmap. just need to convert to bam, sort and index, maybe also remove duplicates and addgroups.

ADD REPLY • link 8.4 years ago by apelin20 ▴ 480

1

Entering edit mode

Gatk-hc completely ignores gap placement. It doesn't matter whether the gap placement is good or not. Freebayes realigns reads. As long as the gaps are present on a couple of reads, they can fix the alignment of other reads. Realignment from multiple reads is much more powerful than aligning each read independently. Especially for modern variant callers, the accuracy on placing gaps is usually unimportant. For a mapper, what is most important is whether it can map the whole read around the correct position with accurate mapping quality.

Freebayes is the best general-purpose open-source variant caller. I would suggest making bbmap work with it, no matter whether freebayes supports all SAM/BAM features.

ADD REPLY • link 8.4 years ago by lh3 33k

0

Entering edit mode

I completely agree about FreeBayes, it's now our main variant caller as it called variants that were mostly confirmed by Sanger sequencing. It's also good for cancer samples as it has low AF / high ploidy parameters you can add to make it ultra sensitive.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by alons ▴ 270

0

Entering edit mode

Thanks for the detailed answer, I will try using BBMap and see how it works out, we're currently using the variant callers you've mentioned so it shouldn't be a problem to start from BBMap and then continue 'as usual' with said variant callers, I will try and adjust it so that freebayes could read it properly.

I did look at the bam file in IGV and samtools tview and it did seem that those reads were unmapped/incorrectly mapped as none of the reads in the position supported the 2 base deletion and 1 base insertion the Sanger analysis indicated, leading me to think the problem wasn't in the variant caller.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by alons ▴ 270

1

Entering edit mode

8.4 years ago

apelin20 ▴ 480

There is an old trick in the book that almost never fails. If the mutation you saw in your chromatogram file is significant (somewhat abundant) and real (not an artifact), then there is no reason for it not to be in your reads. You can use grep in your reads to find out how many reads support that indel. Simply take 13bp to the left and 13bp to the right of your variant, and grep that sequence. For instance, let's say you have the following scenario:

Ref    : ATGCTAGCTGATCGATCGTGACTGCTG
Mutant : ATGCTAGCTGAT--ATCGTGACTGCTG

I would grep the mutant genotype ATGCTAGCTGATATCGTGACTGCTG from my fastq file.

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by apelin20 ▴ 480

0

Entering edit mode

Thanks, I've actually thought about it but wasn't sure it can be done efficiently in large .fastq files - now that you've said it's basically doable I will definetly try it out!

ADD REPLY • link 8.4 years ago by alons ▴ 270

1

Entering edit mode

the bigger the coverage, the more reliable the method is. Don't forget to also take the reverse compliment as well and grep it: grep -c "GENOTYPE" file.fastq

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by apelin20 ▴ 480

Ram · Accepted Answer · 2015-11-22

6

Entering edit mode

8.4 years ago

lh3 33k

By "false positive", do you really mean "false negative"? By "2-based deletion and 1-base insertion", do you mea "2-base insertion and 1-base deletion wrt hg19"?

You can cut the flanking sequences and map to hg19 again. You will find that chr3:178936095 is similar to the ~chr22:17052975. They differ by a couple of mismatches and one gap. The gap difference happens to occur at chr3:178936116. It is possible that bwa-mem maps the reads on chr3 to chr22 (false mapping); it is also possible that your sanger PCR is non-specific (false priming and thus NGS is correct). Hard to conclude without detailed analysis.

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by lh3 33k

0

Entering edit mode

Thanks for the detailed answer!

By false-positive I mean that our variant callers called a mutation there based on NGS data, but when inspected in the sanger results, the peaks (according to the lab) indicated that the mutation wasn't real.

By "2-based deletion and 1-base insertion" I mean with regard to hg19, 2 bases from that position in hg19 were deleted and then there was an insertion of a single base (that wasn't in hg19).

I will definitely look at chr22:17052975 and see if it was aligned there.

By "flanking sequences" you mean 5' flanking region and/or 3' flanking region? How does one go about cutting them - remove them from the .fastq files?

The Sanger PCR the lab has done was based on primers they created from the gene the position's in, PIK3CA gene.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by alons ▴ 270

1

Entering edit mode

I am confused. You said "Looking through the VCF files, I didn't find the deletion+insertion position, nor did I find it when I checked the BAM file". If you did not find the indels from NGS data, how could they be false positives?

ADD REPLY • link 8.4 years ago by lh3 33k

1

Entering edit mode

Or do you mean that Sanger and NGS find different mutations?

ADD REPLY • link 8.4 years ago by lh3 33k

0

Entering edit mode

First of all, you deserve a huge thank you - I have news: I found an exact match in the bam file, between a sequence in the area you've mentioned and the mutation sequence that was deduced from the lab analysis of the Sanger sequencing!

9 identical 45 bases long Forward & Reverse sequences were found perfectly aligned in the bam file to somewhere near the area you've mentioned. I've pasted them in UCSC BLAT tool and it said that it had a 100% match to the area in chr22 and a 97.8% match to chr3. This means bwa mem did a perfectly good job in aligning as it had a 100% match with said area. The mutation apparently made the sequences be identical to that area and thus weren't aligned to where they were originated in the DNA.

I'm actually super impressed BTW, did you use BLAT to tell the areas were similar?

Now, what can one do in the future if such mutations cause alignments to completely different areas? The lab said that the mutation we've called (maybe falsely) in chr3:178936095 might have been caused by the real indel in chr3:178936116.

p.s I meant that I couldn't find the indel the lab called based on the Sanger sequencing. Also, Sanger found the indel but indicated the mutation that was found in chr3:178936095 wasn't real.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by alons ▴ 270