Question

cufflinks with reference annotation, FPKM for a gene

0

Entering edit mode

8.6 years ago

tonja.r ▴ 600

I run a cufflinks (2.2.1) with annotated exons (I do not want t count intron reads) and got the following gene tracks and isoform tracks. FPKM for a gene is not the sum of FPKMs of the isoforms, so I am asking myself how they calculated the FPKM for a gene.

I also run cufflinks with only transcript annotation and got totally different results.

There is little information about cufflinks running with reference annotation. As far as I understand, cufflinks counts the reads only in the specified regions in the annotation file and as the total length of the exons belonging to one transcript differ from the total length of the transcript (because of the introns), the coverage and FPKM will also differ.

Only exon annotation:

Isoform tracks:

tracking_id    class_code    nearest_ref_id    gene_id    gene_short_name    tss_id    locus    length    coverage    FPKM    FPKM_conf_lo    FPKM_conf_hi    FPKM_status
ENSMUST00000162897.1    -    -    ENSMUSG00000051951.5    Xkr4    -    chr1:3195981-3206425    4153    23.0161    2.7933    2.46392    3.12269    OK
ENSMUST00000159265.1    -    -    ENSMUSG00000051951.5    Xkr4    -    chr1:3196603-3205713    2989    22.8029    2.76743    2.38783    3.14703    OK
ENSMUST00000070533.4    -    -    ENSMUSG00000051951.5    Xkr4    -    chr1:3204562-3661579    3634    32.277    3.91723    3.5261    4.30836    OK

gene tracks:

tracking_id    class_code    nearest_ref_id    gene_id    gene_short_name    tss_id    locus    length    coverage    FPKM    FPKM_conf_lo    FPKM_conf_hi    FPKM_status
ENSMUSG00000051951.5    -    -    ENSMUSG00000051951.5    Xkr4    -    chr1:3195981-3661579    -    -    9.47796    8.91496    10.041    OK

With only transcript annotation:

isoform track:

tracking_id    class_code    nearest_ref_id    gene_id    gene_short_name    tss_id    locus    length    coverage    FPKM    FPKM_conf_lo    FPKM_conf_hi    FPKM_status
ENSMUST00000162897.1    -    -    ENSMUSG00000051951.5    Xkr4    -    chr1:3195981-3206425    10444    21.4978    2.58177    2.39326    2.77028    OK
ENSMUST00000159265.1    -    -    ENSMUSG00000051951.5    Xkr4    -    chr1:3196603-3205713    9110    5.17168e-06    6.2109e-07    0    0.00526385    OK
ENSMUST00000070533.4    -    -    ENSMUSG00000051951.5    Xkr4    -    chr1:3204562-3661579    457017    0.188298    0.0226136    0.0202322    0.024995    OK

gene track:

tracking_id    class_code    nearest_ref_id    gene_id    gene_short_name    tss_id    locus    length    coverage    FPKM    FPKM_conf_lo    FPKM_conf_hi    FPKM_status
ENSMUSG00000051951.5    -    -    ENSMUSG00000051951.5    Xkr4    -    chr1:3195981-3661579    -    -    2.60439    2.41573    2.79304    OK

RNA-Seq • 3.4k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.6 years ago by tonja.r ▴ 600

Ram · Accepted Answer · 2015-09-25

Hi again, glad u could make it work !

FPKM for a gene is not the sum of FPKMs of the isoforms, so I am asking myself how they calculated the FPKM for a gene.

With the "only exon" annotation, FPKM for a gene is the sum of FPKMs of the isoforms : 2.7933 + 2.76743 + 3.91723 = 9.47796

Now providing an "only transcripts" annotation seems really weird to me because, as you said, Cufflinks won't know where are the introns. This is probably why you get lower FPKM values for the second and third isoforms. Also, to answer your question, FPKM are computed by Cufflink using a probabilistic method with read length correction... read their paper if you want the details. I don't know why the sum of the isoforms FPKM values is not equal to the gene FPKM value but it is possibly linked to that "probabilistic computation". Anyway, you probably shouldn't use only transcript annotation.

Best,
Carlo