Expression data from cufflinks merged gtf
0
0
Entering edit mode
5.4 years ago
ddowlin ▴ 70

Hi all,

I have assembled six transcriptomes with hisat2/stringtie. I merged the resulting gtf files into a single merged gtf.

For many transcripts in the merged gtf a transcript_id starting with 'MSTRG' was created. My understanding is that the MSTRG ids are assigned to transcripts in the merged gtf which may have different ids in the individual unmerged gtfs. This could be because there is no reference id for this transcript.

I am now interested in looking at the expression levels (TPM) of certain transcripts in the individual unmerged samples. However, I am having trouble matching the MSTRG id in the merged gtf to the corresponding ids in the unmerged gtfs.

I attempted to solve this problem using bedtools intersect to get the overlapping coordinates in the merged gtf with one of the unmerged gtfs. This allows me to map the MSTRG id to the unmerged id.

However, I now have a new problem: in some cases a single MSTRG id is assigned to multiple unmerged ids. See below for a simplified example:

22942    24454    gene_id "25"    TPM "3"    19883    26517    gene_id "MSTRG.34"
19883    22800    gene_id "26"    TPM "5"    19883    26517    gene_id "MSTRG.34"
24624    26412    gene_id "27"    TPM "5"    19883    26517    gene_id "MSTRG.34"

My questions are.

  1. Why has stringtie merged these multiple transcripts into a single transcript in the merged gtf?
  2. How can I treat the TPM values as referring to a single transcript (i.e. the MSTRG id) and if so what is the best way t o do this?
    • Get the mean TPM per gene_id
    • Sum the TPM values per gene_id

Many thanks.

stringtie RNA-Seq bedtools gtf • 1.1k views
ADD COMMENT

Login before adding your answer.

Traffic: 1511 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6