what is the purpose of indexing the reference genome (Kallisto)
1
0
Entering edit mode
10 days ago
Aaliya ▴ 10

Hello everyone! I have performed Kallisto; I understand the principle of pseudo alignment it works on. However, I cannot wrap my head around two things:

  1. What exactly happens in indexing? How does the indexed file look like? (I am unable to open my indexed file due to large size). I am using Homo_sapiens.GRCh38.cdna.all for indexing.

  2. During quantification, is the RNA seq data (.fastq.gz) (which is in the form of transcripts, I am guessing) mapped to the indexed file?

The output I get when I run indexing is:

**[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31

[index] number of targets: 205,792

[index] number of k-mers: 116,049,664

[index] number of equivalence classes: 841,430

[quant] running in paired-end mode

[quant] will process pair 1: DRR438599_1.fastq.gz

                             DRR438599_2.fastq.gz

[quant] finding pseudoalignments for the reads ... done

[quant] processed 15,709,385 reads, 9,655,075 reads pseudoaligned

[quant] estimated average fragment length: 146.406

[   em] quantifying the abundances ... done

[   em] the Expectation-Maximization algorithm ran for 1,249 rounds**

Why is K-mer always 31 for the Homo sapiens reference genome?

indexing Kallisto • 346 views
ADD COMMENT
2
Entering edit mode
10 days ago
dsull ★ 6.0k

I'll see if I can explain things in super simple terms (albeit with some abuse of the actual technical details) to provide a conceptual understanding:

kallisto (and other pseudomappers) index the reference TRANSCRIPTOME, not genome. You take your 205,792 transcript sequences and those are the targets you map against (e.g. does my sequencing read come from from transcript #50, transcript #2858, or transcript #85053?). This is different than genome alignment where you can say "oh, my 100-bp RNA-seq read spans positions 507695-507795 of the X chromosome").

The kallisto index is, in the simplest sense, a hash table of k-mers (i.e. sequences of length k). The index file contains a binary representation of this hash table (which is why you can't open it up in a text editor).

And yes, during quantification, your RNA-seq reads are mapped to this index.

Perhaps your RNA-seq read contains a 31-bp sequence like ACGATTTTAAACCGAGAGACTAGAGAGCCCC (for k =31, which is the default setting). You look up at the 31-bp sequence in your hash table to determine which transcript(s) that 31-bp sequence may have originated from.

k=31 is the default setting (regardless of Homo sapiens, mouse, dog, or something else). You can set k to another number than 31 if you want, but 31 works well. Imagine if you have k=3 instead. There are fewer than 100 possible k-mers you can get from k=3 (heck, a ton of the k-mers that you look up will map to tens of thousands of possible transcripts). So, not ideal! Obviously, if you're looking for RNA molecules that are only 15-bp long, you'd need a smaller k (you can't create a 31-bp k-mer from a sequence that's only 15-bp long), but otherwise, just stick to 31 :)

Why can't we go up to, say, k=55? That's because each nucleotide, in binary representation, requires two bits to encode it (4 nucleotides: 00, 01, 10, 11). If you have k=55, that means 110 bits (but my Macbook is a 64-bit machine -- I don't have 110 bits! ). OK, yes, you could do some workarounds and engineer software to store >64-bit data structures, but I don't see the need to in the case of kallisto...

(Note: there are other instances why increasing k might be suboptimal sometimes in terms of mapping accuracy, but I'll withhold that discussion for another time or unless someone asks).

Edit: Why do we need this hash table index? Well, if you want to scan all 31-bp sequences in your RNA-seq reads and grep for them in your Homo_sapiens.GRCh38.cdna.all FASTA file, be my guest ;) Might take you a bajillion years.

ADD COMMENT
1
Entering edit mode

Specifically:

If we want to look up all the places where a 31mer is in a genome without and index, we have to compare that 31 to the first 31bp of the genome, then bases 2-32, then bases 3-33. Thus i we have to do n-31 comparisons, where n is the length of the genome. We say that doing this is O(n).

A hash table is a table where in one column you have the 31mer, and in the other you have a list of all the transcripts it could have come from.

With a standard table like this, you would have to say "is my 31mer the first row? No, then is it the second row?" etc. This is actaully worse for a 31 mer because the table would have up to 4^31 rows.

However, the clever things about a hash table is that you have a "hash function" which takes the sequence of the 31mer and converts it to a number. That number is a memory location, where the table row for that 31mer (containing all the transcripts that contain that 31 mer) is stored.

Thus finding the correct row takes a fixed amount of time that doesn't change with the length of the genome or the length of the kmer. We say that it is O(1).

ADD REPLY

Login before adding your answer.

Traffic: 2614 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6