I would like to announce our recently published software tool called MAGERI that is designed to
facilitate the detection of ultra-rare variants from various kinds of
high-throughput sequencing datasets prepared using the molecular barcoding technology.
The ability to detect ultra-rare variants having ~0.1% frequency in the sample is
one of the key objectives for successful circulating tumor DNA screening, studying
rare tumor subpopulations and rare drug resistant variants in viral populations.
However, the sequencing error rate is far beyond the limit required for
accurate rare variant calling even for sequencing datasets of top-tier quality.
Recent development of the molecular barcoding technology allows eliminating sequencing
errors by tagging each input molecule with an unique molecular identifier (UMI) [Marx. Nature Methods 2017]. UMI-tagged
read groups can be then assembled into consensuses correcting sequencing errors. Still,
residual PCR errors introduced at first PCR cycles and during UMI tag attachment can
decrease the accuracy of variant calling. Moreover, (to the best of my knowledge) so far there
is no dedicated variant caller that can model error rates in UMI-tagged read group consensus
sequences. MAGERI software aims to solve this problem by implementing a consensus
assembly, alignment and variant calling pipeline optimized for the UMI-tagged data
[Shugay et al. Plos Comp Biol 2017].
Note that the datasets containing rare variants with known frequency and a control
dataset from healthy donor plasma DNA are publicly available at SRA; see this repository for metadata
and analysis scripts/templates.
We hope that these benchmark datasets will be of use to the community, especially for the
researchers developing software tools for UMI-tagged data processing and rare variant