Biostar Beta. Not for public use.
How are probabilities of insertions and deletions encoded in FASTQ?
1
Entering edit mode
13 months ago
shelkmike • 130
Russian Federation

This is a basic question; however, I couldn't find an answer anywhere. Traditionally, the quality score of a base in FASTQ indicates the probability that this base is wrong. This is reasonable for Illumina, where the typical sequencing error is a single base substitution (for example, "A" occurs in a sequencing read where "G" in fact should be). However, for some sequencing machines, like the sequencing machines of Oxford Nanopore Technologies (ONT), deletions and insertions are also frequent. In a FASTQ file, each base of a read has exactly one symbol denoting its quality.

For example, the problem arises in this case: There is a read with the sequence ATTGCTAC. Probabilities that all bases are correct are 100% (let's simplify), but there is a very possible insertion of TAT between G and C. How can the probability of this insertion be encoded in FASTQ, if each quality symbol in FASTQ corresponds strictly to one base of a read?

My main questions are: 1) Do FASTQ files with ONT reads incorporate probabilities of insertions and deletions or they take into account only probabilities of single base substitutions? 2) If probabilities of indels are encoded in FASTQ, how exactly is it made?

I will be grateful for help

ADD COMMENTlink
0
Entering edit mode

i think such information comes from mapper/aligner (in reference based assemblies). read about sam format (most followed alignment format) esp CIGAR strings. shelkmike

ADD REPLYlink
2
Entering edit mode
9 months ago
Freiburg, Germany
  1. No, fastq files in only contain per-base call quality scores. There's no information about the likelihood of an InDel.
  2. N/A

InDels tend to be randomly distributed in nanopore data, with the exception of an enrichment in homopolymer stretches.

ADD COMMENTlink
0
Entering edit mode

Thank you. Also, can you give a link to a source where I can read about this?

ADD REPLYlink
0
Entering edit mode

Any review of ONT data should talk about InDel distributions, I've seen ONT talk about it in their presentations even. For phred scores, there won't be anything that mentions that.

ADD REPLYlink
0
Entering edit mode

Thank you once again. Sorry for doubts, but if nothing mentions it, how do you know that probabilities of indels are not reflected in FASTQ in some way?

ADD REPLYlink
2
Entering edit mode

Because FASTQ files aren't structured in a way that would permit that.

ADD REPLYlink
1
Entering edit mode

Fastq quality scores encode the probability of this specific nucleotide being in that specific position of the read. It doesn't know anything about variants (neither SNPs or indels), because that you only get by comparing it to the reference genome.

ADD REPLYlink
0
Entering edit mode

By indels I mean not genomic variants, but sequencing errors which result in insertion or deletion of a sequence in a sequencing read compared to the genome.

ADD REPLYlink
1
Entering edit mode

Well, at the location of a false-deletion-sequencing-error the pore essentially skipped a few nucleotides - didn't read them carefully enough. This might result in a lower quality for the nucleotides surrounding this fake-deletion, but this doesn't inform us about what the cause might be - the indel. So no, the probability of an indel is not explicitly encoded in the fastq.

ADD REPLYlink
0
Entering edit mode

it is possible for q scores to reflect homopolymer errors, but not indels. I don't know enough about ONT to say how the signal is processed to produce Q scores. The example the OP provided would not fit the category of a homopolymer.

ADD REPLYlink
0
Entering edit mode

The sequences in the fastq files represent one molecule. The terms insertions/deletion only works in comparison to something.

The quality scores given in the fastq files are the result by comparing the signal to noise. At each measure point there must be a signal. It is not possible that you sometimes measure nothing ("because there is an deletion") and get a signal again to a later point. The DNA molecule is continuous and have no spaces.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1