Question

Too low FPKM values from StringTie

0

Entering edit mode

5.2 years ago

SJ Basu ▴ 50

Hello People,

I am analyzing 20 samples RNA-Seq data of leukemia. To this end, I have used the Hisat2-StingTie2 pipeline to analyze the samples and arrive at FPKM counts for each gene. After I have run the complete process, I found that the FPKM values are way too low for calculating Differential gene expression and it seems there is no difference in expression across samples.

                B85         B86         B78         B89
       Minimum  0           0           0           0
First Quartile  0.004382    0.004093    0.019453    0.011406
        Median  0.150255    0.153044    0.167367    0.144218
Third Quartile  1.41        1.463101    1.180386    1.064904
       Maximum  120290.64   144817.40   116797.73   119615.86
      Skewness  91.670      97.440      86.229      93.291

These FPKM values were taken from StringTie file using -A options. These samples were processed using Illumina truseq stranded total mrna kit alongside ribo-depletion kit and was sequenced on NextSeq 550 with 2X75bp reads (avg 60mil reads per sample). Peculiar thing is Hisat2 have shown >90% mapping and there isn't even rRNA contamination. And on top, I have even mapped the reads to cDNA database which also showed >55% reads mapping.

The commands were as follows:

hisat2 -p 60 --dta -x hsa_GRCh38_index -1 sample_R1_pair.fastq.gz -2 sample_R2_pair.fastq.gz -S sample_map2genom.sam && samtools view -bSh -@ 45 sample_map2genom.sam | samtools sort -@ 45 > sample_map2genom.bam

stringtie ${i}_map2genom.bam -G Homo_sapiens.GRCh38.95.gtf -o sample_assembly_strg.gtf -p 60 -A ${i}_abundance.txt -B

This pipeline has worked great for me before but this particular output is totally puzzling me. Has anyone encountered this before ?? Please advice.

Thanks and regards

RNA-Seq stringtie new tuxedo • 3.0k views

ADD COMMENT • link 5.2 years ago by SJ Basu ▴ 50

score 1 · Answer 1 · 2019-02-12

1

Entering edit mode

5.2 years ago

shawn.w.foley ★ 1.3k

How many genes have >0, >1, or >10 FPKM. What type of samples are these (cell line, tissue, etc). And how many genes are in your genome?

It's not unreasonable to have an Ensembl annotation with >55,000 genes of which only ~8-15,000 genes are expressed. I'm wondering if annotated but unexpressed genes are hiding the true signal. This would be especially apparent in tissue culture cell lines.

ADD COMMENT • link 5.2 years ago by shawn.w.foley ★ 1.3k

0

Entering edit mode

Hi Shawn,

These are WBCs (leukemia) extracted from FFPE samples but rna quality was good.

For number of genes >0 is ~35k , >1 is ~15-18k and >10 is ~3-5k

"I'm wondering if annotated but unexpressed genes are hiding the true signal"...I didn't exactly get that....

ADD REPLY • link 5.2 years ago by SJ Basu ▴ 50

1

Entering edit mode

Poor wording on my part. I think your data looks reasonable, 15-18k with >1FPKM is in the realm of what I normally see. I think the fact that you have ~75% of genes with <1 FPKM is because most of them just aren't expressed. I would set an expression threshold and then analyze everything above that threshold in your differential expression.

I normally generate a density plot in R, it'll often show a bimodal curve, then I determine my cutoff for expression empirically.

ADD REPLY • link 5.2 years ago by shawn.w.foley ★ 1.3k

0

Entering edit mode

Actually the problem is I don't see a "CURVE", its just a really long, sharp PEAK at ~0 FPKM value....it looks I might as well take a cut-off of FPKM value of 2 and still would retain more than 25K genes.

Another thing is, which FPKM do you consider ? The one that is given by -A option or the one that comes out from t_data.ctab....

ADD REPLY • link 5.2 years ago by SJ Basu ▴ 50

1

Entering edit mode

Try making a density plot of the log10(FPKM), that should show a more appropriate curve. And I'm not sure which output to recommend, I haven't used HiSat before, I tend to stick to STAR/HTSeq/DESeq2 or more recently Salmon/DESeq2.

ADD REPLY • link 5.2 years ago by shawn.w.foley ★ 1.3k

1

Entering edit mode

All RNA-seq count data is shifted toward zero, so, that is expected. RNA-seq counts follow a negative binomial distribution. I don't believe that production of FPKM expression values results in an overall change from this data distribution. A transformation of normalised counts via, e.g., variance-stabilisation, logging, Z-scaling, or a combination of these, will alter the distribution.

Also keep in mind that FPKM units are not suitable for differential expression comparisons.

ADD REPLY • link 5.2 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin,

Thanks for reassurance and the information. As for the FPKM units, so which units can reliably be used for differential expression comparisons ?? TPM ??

ADD REPLY • link 5.2 years ago by SJ Basu ▴ 50

2

Entering edit mode

Hey, if you just follow the recommendation by shawn (Salmon --> DESeq2), then you'll have data that is suitable for differential expression. Salmon will require input FASTQ files, I believe, and bypasses the production of a BAM file.

The other method, STAR --> HTSeq --> DESeq2, will produce a BAM file, with raw counts then derived by HTSeq.

Note, that normalisation and differential expression analysis occurs in DESeq2.

As you have already used StringTie, you can actually use a function, prepDE.py, which conveniently converts your HISAT2 / StringTie dataset to raw counts that are suitable for DESeq2: http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#deseq

Criticism of FPKM for differential expression analysis is out there - just do a search. You will not find many voices actively promoting it for such analyses (although publications do exist).

ADD REPLY • link 5.2 years ago by Kevin Blighe 87k