Can phasing or pre-phase during basecall cause indel?
1
1
Entering edit mode
6.4 years ago
CY ▴ 750

I read about phasing and pre-phase during basecall recently. It is a concept that I have not paid attention to.

I am sensing that, although phasing correction has been applied during basecall, this kind of bias could still be a problem for Indel detection, right?

Is there specific computational step is design to take care of this potential bias when calling Indel? Any inputs are appreciated.

indel phasing • 2.8k views
ADD COMMENT
3
Entering edit mode
6.4 years ago

I think that your terminology may throw a lot of people off because the word phasing is usually reserved for haplotype phasing, i.e., the determination of the allele (maternal or paternal) on which a particular variant appears. Instead, I am certain that you refer to the type of phasing that occurs during sequencing-by-synthesis (SBS). Thankfully, I have a colleague who worked on the development of the Solexa technology (which was purchased by Illumina and eventually used in MiSeq and HiSeq), and s/he explained this to me.

During each cycle of sequencing in SBS, a single base is supposed to be added by polymerase to the growing double strand. Every now and then, more than 1 base will be added during the same cycle - this is Pre-Phasing. When no base is added (when it's supposed to be), it's called Phasing.

Although this can contribute to sequencing errors at every cycle, I don't believe that there is any data to say that it actually results in more indel calls. These types of errors will be isolated to single reads.

To answer your question more specifically, a level of correction does occur within the instrument whilst all of this is happening, and it's a mixture of both wet-lab and informatic 'tricks' that occur. Essentially, one can spike-in a PhiX control sample in order to assist in the detection of these errors and 're-calibrate' the reads as they are being sequenced. The instrument is capable of detecting these issues in the first place based on signal 'cross-talk', i.e., signal strength and the presence of two signals at the same cycle (when pre-phasing has occurred), or no signal (phasing). In the initial cycles, the sequencer may struggle to detect these errors; however, as these errors begin to grow as the run continues, due to the fact that each subsequent cycle is dependent on the fidelity of the previous cycle, the error will eventually be easily detectable and that's when the error correction will kick in.

For further reading, see Illumina's own technical white-paper on this: Using a PhiX Control for HiSeq® Sequencing Runs

Kevin

ADD COMMENT
0
Entering edit mode

Kevin

Thank you for detailed explanation. You mentioned "These types of errors will be isolated to single reads, which is ever more reason why one should never call variants from single reads". However, I gave this question a second thought: This phasing / pre-phasing bias during basecall should only occur on single or few reads in the cluster. Most of the reads within a cluster should generate much intensive signal that easily overwhelm the phasing / pre-phasing signal, right? If this is the case, the correction seems to be redundant. I am feeling missing something here....

ADD REPLY
1
Entering edit mode

Most of the reads within a cluster should generate much intensive signal that easily overwhelm the phasing / pre-phasing signal, right?

Even if they don't, their presence should result in low base quality scores for the remainder of the read.

PCR stutter is a bigger indel issue since that occurs prior to cluster generation thus all reads in the cluster contain the stutter.

ADD REPLY
0
Entering edit mode

How big is the problem, do you think, with the PCR amplification stage? Everyone generally accepts that PCR faithfully amplifies DNA, but I have data that casts doubt on this.

ADD REPLY
2
Entering edit mode

PCR-generated indels are a known problem (at least for most bench scientists) - it's why Illumina developed a PCR-free library prep. They're most common in homopolymers runs (mono/di/trinucleotide repeats).

As @d-cameron noted, indels introduced during library prep will be homogeneous in the cluster (which is initiated from a single library molecule). Cluster generation is also a PCR reaction, so indels introduced at that point will be heterogeneous.

@CY, phasing correction is not designed to handle either of those problems. Instead, it corrects indels generated by DNA polymerase during the sequencing reaction, when some of the molecules within the cluster become +1 or -1 relative to the actual cycle. My understanding is that it compares the base (actually, the signal intensity) of the previous and subsequent cycles to the current cycle and adjusts the signal accordingly. Because it resolves signal heterogeneity, it would not correct library prep errors, and is unlikely to correct cluster-generated errors (except perhaps for mononucleotide indels).

ADD REPLY
0
Entering edit mode

PCR stutter is a big enough issue for micro-satellites that many STR callers explicitly model stutter as part of their calling. These sites are particularly prone due to their repetitive nature.

ADD REPLY
1
Entering edit mode

To CY:

Yes, that's correct but what happens is that the error begins to propagate and increase because the read product from one cycle becomes the substrate / template for the next run, and, thus, the 'cross-talk' erroneous signal grows larger cycle after cycle as the read become more out of sync.

This is why the error checking mechanism kicks in at a later cycle after the sequencing has started, i.e., cycle #25 according to the Illumina white paper:

Real Time Analysis (RTA) software aligns complete sequences to the PhiX reference beginning after the 25th cycle is accumulated and calculates error rates, providing an indication of sequencing success during the run

ADD REPLY
0
Entering edit mode

@Kevin Blighe, the error rate is distinct from phasing issues. It's determined by aligning the phiX spike-in to the reference and calculating the percentage of differences between the two. The majority of those errors are single nucleotide mismatches, which can be caused by PCR misincorporation errors (no polymerase is perfect), chemical damage to the DNA, a bubble in the flow cell, etc. It's calculated after cycle 25 so that the reads are long enough to align accurately.

ADD REPLY
0
Entering edit mode

Makes sense - thanks for sharing, Harold

ADD REPLY
0
Entering edit mode

Also, After reading some material, I think "cross-talk" is irrelevant with phasing/pre-phasing. Phasing/pre-phase occur when not one base added in a cycle while cross-talk occur when multiple clusters stay too close and the signals overlap. Am I right?

ADD REPLY
1
Entering edit mode

There are possibly different types of cross-talk. Cross-talk is a general term and is even mentioned in genotyping arrays. It is definitely mentioned in relation to phasing/pre-phasing, though, but I don't doubt that it's also used in relation to clusters.

ADD REPLY
1
Entering edit mode

Got it. Thanks for explaining, Kevin.

ADD REPLY

Login before adding your answer.

Traffic: 1968 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6