Question

Predicted gene differentially expressed in Kallisto, not in FeatureCounts

0

Entering edit mode

6.1 years ago

Bastien Hervé 5.3k

I'm currently trying to output differentially expressed (DE) genes from mm10 RNA-seq data. I did in parallel a genomic approche (Star -> featureCounts -> DESeq2 -> Heatmap) and a transcriptomic approche (Kallisto -> DESeq2 -> Heatmap) to intersect my results.

My results tend to overlap, except a gene which is highly expressed in Kallisto but not in featureCounts.

This gene is a predicted gene, Gm4737. The counts for this gene, after librairies size normalization are :

dds <- DESeq(dds)
dds_norm <- counts(dds, normalized=TRUE)

Approche/Condition    A1    B1    C1    A2    B2    C2    A3    B3    C3
     Kallisto         0     1019  0     0     1416  1226  0     1209  34
   FeatureCounts      306   581   295   230   531   501   138   457   249

My reference genome :

url='ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_mouse/release_M16';
axel -q $url/GRCm38.p5.genome.fa.gz;

My reference transcriptome :

url='ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_mouse/release_M16';
axel -q $url/gencode.vM16.transcripts.fa.gz;

My annotation file :

url='ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_mouse/release_M16';
axel -q $url/gencode.vM16.chr_patch_hapl_scaff.annotation.gtf.gz;

I used the same annotation file in both approche.

Any ideas about the huge counts difference between the two approches ? Does kallisto has problems with predicted genes ?

Thanks !

Kallisto FeatureCounts • 3.3k views

ADD COMMENT • link updated 6.1 years ago by h.mon 35k • written 6.1 years ago by Bastien Hervé 5.3k

2

Entering edit mode

It is interesting that you have a specific example but it may not be surprising due to fundamental difference between methodologies, alignment (STAR) and mapping (kallisto) (See A: Alignment and mapping ).

ADD REPLY • link 6.1 years ago by GenoMax 141k

0

Entering edit mode

You seems to be right, according to my IGV visualization. I checked at my pseudoalignments from kallisto 0.44.0, on Gm4737 positions, for A1 condition, there is 0 read mapped, whereas, for aligned reads (Star), on Gm4737 positions, for A1 condition, I got some reads aligned. And, for pseudoalignments, on Gm4737 positions, for B1 condition, there are a lot of reads mapped, which correlates with my counts table above.

Even if the two approches are different, I am still stuned by the colossal difference of counts that are generated on some genes while others are quite similar...

Thanks for the link !

ADD REPLY • link 6.1 years ago by Bastien Hervé 5.3k

1

Entering edit mode

Forgot about this post: Mapping to a transcriptome can incorrectly report reads as mapping uniquely.

ADD REPLY • link 6.1 years ago by h.mon 35k

score 1 · Answer 1 · 2018-03-16

1

Entering edit mode

6.1 years ago

h.mon 35k

I suppose this is a mouse predicted gene, right? If you blast Gm4737 (chr6), there is an almost identical gene, Ahcyl1 (chr3), and an almost identical pseudogene, Gm9826 (chr1). Blast says they are 99% identical over 99% Gm4737 length. This probably leads to several reads being multi-mapped, and featureCounts discards them (unless you used -M), while kallisto uses expectation-maximization to optimally apportion reads to genes.

My initial intuition was featureCounts counts would always be lower than kallisto counts, as featureCounts throws away reads that kallisto doesn't, but this is not exactly what happens. You should check these other genes, and see if the reads mapping on one location are also mapping on the other.

ADD COMMENT • link 6.1 years ago by h.mon 35k

0

Entering edit mode

Nice hint ! So yeah this is a mouse predicted gene and I don't allow multimapping. I used featureCounts in R so I don't have the -M option, but I set this one :

allowMultiOverlap=FALSE

"while kallisto uses expectation-maximization to optimally apportion reads to genes."

So you mean that kallisto can't multi-mapped, it will select the "best" gene for each read ? For featureCounts, with my option, every read multimapped will be discard, like you said Kallisto should have higher counts.

I got the results for identical gene and pseudogene (Is there a way to make a simple table in Biostars ?) :

Approche/Condition    A1      B1      C1      A2      B2      C2      A3      B3      C3
       **Gm4737**
        Kallisto      0       1019    0       0       1416    1226    0       1209    34
   FeatureCounts      306     581     295     230     531     501     138     457     249
       **Gm9826**
        Kallisto      0       0       1       0       1       3       0       0       0
   FeatureCounts      54      70      68      54      81      88      42      56      33
       **Ahcyl1**
        Kallisto      3608    3379    3543    3390    3712    3801    3422    3749    2782
   FeatureCounts      3767    3441    3729    3488    3534    3793    3195    3216    2876

My opinion is that Gm4737 is overexpressed because Ahcyl1 and Gm4737 are too similar to be specificaly identified. In both approche some reads are affected to Gm4737 (and Gm9826).

What's bothering me is, for kallisto approche, in condition A1, I got full reads on Ahcyl1 and for condition B1, I got 1/4 on Gm4737 and 3/4 on Ahcyl1. What's happended here ? Why suddenly for condition B kallisto selected Gm4737 over Ahcyl1 ?

ADD REPLY • link 6.1 years ago by Bastien Hervé 5.3k

1

Entering edit mode

So you mean that kallisto can't multi-mapped, it will select the "best" gene for each read ?

No, it will selected the best distribution of ambiguously mapped reads based on the distribution of unambiguously mapped reads - that is, it may add fractions of counts to several transcripts / genes. That is the reason you have counts like "ENSMUST00000059524.6 : 1105.28" for kallisto, but not for featureCounts.

ADD REPLY • link 6.1 years ago by h.mon 35k

0

Entering edit mode

Oh, by exemple let say that a read can map on transcript1 and transcript2.

With 100 unambiguously mapped reads.

transcript1 has 75 unambiguously mapped reads and transcript2 has 25 unambiguously mapped reads.

So kallisto will split the read fairly according to the unambiguously mapped reads, like 0.75 to transcript1 and 0.25 to transcript2 ?

ADD REPLY • link 6.1 years ago by Bastien Hervé 5.3k

0

Entering edit mode

Do you get the same answer if you re-run kallisto for cond B?

ADD REPLY • link 6.1 years ago by GenoMax 141k

0

Entering edit mode

I forgot to mention that all the counts above are from DESeq2 counts, so after the sizeFactor normalization. That is not the raw counts.

I retrieve the raw counts for condition B1 below. I re-run kallisto on B1 and I got the same results.

Condition B1

Kallisto

Gm4737

ENSMUST00000059524.6 : 1105.28

Gm9826

ENSMUST00000029124.7 : 0

Ahcyl1

ENSMUST00000029490.14 : 3335.94
ENSMUST00000153623.1 : 48.3669
ENSMUST00000138091.7 : 401.973
ENSMUST00000153530.1 : 560.809
ENSMUST00000138116.1 : 58.8604
ENSMUST00000144864.7 : 4.40848
ENSMUST00000137583.1 : 15.4503
ENSMUST00000151935.1 : 58.0011

FeatureCounts

Gm4737

ENSMUSG00000048087.6 : 666

Gm9826

ENSMUSG00000048538.7 : 81

Ahcyl1

ENSMUSG00000027893.14 : 3942

Hope this will help !

ADD REPLY • link 6.1 years ago by Bastien Hervé 5.3k

0

Entering edit mode

Do you know how to deal with this kind of pseudogenes in kallisto ? Gm4737 counts in featureCounts are broadly equal through my conditions. i was expected more or less the same thing using kallisto, even if (like said above) these are different methodologies. What would you do in this case ? "Silence" pseudogenes, keep them in DE analysis...?

ADD REPLY • link 6.1 years ago by Bastien Hervé 5.3k