Question

what is the purpose of indexing the reference genome (Kallisto)

0

Entering edit mode

10 days ago

Aaliya ▴ 10

Hello everyone! I have performed Kallisto; I understand the principle of pseudo alignment it works on. However, I cannot wrap my head around two things:

What exactly happens in indexing? How does the indexed file look like? (I am unable to open my indexed file due to large size). I am using Homo_sapiens.GRCh38.cdna.all for indexing.
During quantification, is the RNA seq data (.fastq.gz) (which is in the form of transcripts, I am guessing) mapped to the indexed file?

The output I get when I run indexing is:

**[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31

[index] number of targets: 205,792

[index] number of k-mers: 116,049,664

[index] number of equivalence classes: 841,430

[quant] running in paired-end mode

[quant] will process pair 1: DRR438599_1.fastq.gz

                             DRR438599_2.fastq.gz

[quant] finding pseudoalignments for the reads ... done

[quant] processed 15,709,385 reads, 9,655,075 reads pseudoaligned

[quant] estimated average fragment length: 146.406

[   em] quantifying the abundances ... done

[   em] the Expectation-Maximization algorithm ran for 1,249 rounds**

Why is K-mer always 31 for the Homo sapiens reference genome?

indexing Kallisto • 346 views

ADD COMMENT • link updated 10 days ago by dsull ★ 6.0k • written 10 days ago by Aaliya ▴ 10

score 2 · Answer 1 · 2024-04-26

I'll see if I can explain things in super simple terms (albeit with some abuse of the actual technical details) to provide a conceptual understanding:

kallisto (and other pseudomappers) index the reference TRANSCRIPTOME, not genome. You take your 205,792 transcript sequences and those are the targets you map against (e.g. does my sequencing read come from from transcript #50, transcript #2858, or transcript #85053?). This is different than genome alignment where you can say "oh, my 100-bp RNA-seq read spans positions 507695-507795 of the X chromosome").

The kallisto index is, in the simplest sense, a hash table of k-mers (i.e. sequences of length k). The index file contains a binary representation of this hash table (which is why you can't open it up in a text editor).

And yes, during quantification, your RNA-seq reads are mapped to this index.

Perhaps your RNA-seq read contains a 31-bp sequence like ACGATTTTAAACCGAGAGACTAGAGAGCCCC (for k =31, which is the default setting). You look up at the 31-bp sequence in your hash table to determine which transcript(s) that 31-bp sequence may have originated from.

k=31 is the default setting (regardless of Homo sapiens, mouse, dog, or something else). You can set k to another number than 31 if you want, but 31 works well. Imagine if you have k=3 instead. There are fewer than 100 possible k-mers you can get from k=3 (heck, a ton of the k-mers that you look up will map to tens of thousands of possible transcripts). So, not ideal! Obviously, if you're looking for RNA molecules that are only 15-bp long, you'd need a smaller k (you can't create a 31-bp k-mer from a sequence that's only 15-bp long), but otherwise, just stick to 31 :)

Why can't we go up to, say, k=55? That's because each nucleotide, in binary representation, requires two bits to encode it (4 nucleotides: 00, 01, 10, 11). If you have k=55, that means 110 bits (but my Macbook is a 64-bit machine -- I don't have 110 bits! ). OK, yes, you could do some workarounds and engineer software to store >64-bit data structures, but I don't see the need to in the case of kallisto...

(Note: there are other instances why increasing k might be suboptimal sometimes in terms of mapping accuracy, but I'll withhold that discussion for another time or unless someone asks).

Edit: Why do we need this hash table index? Well, if you want to scan all 31-bp sequences in your RNA-seq reads and grep for them in your Homo_sapiens.GRCh38.cdna.all FASTA file, be my guest ;) Might take you a bajillion years.