How to add tss_id and p_id in an Ensembl GTF (or any GTF other than that generated by cufflinks)
2
1
Entering edit mode
9.7 years ago
komal.rathi ★ 4.1k

Hi everyone,

I am working with mm10 data and using the GRCm38 build 75 GTF from Ensembl. As everyone knows you need the tss_id and p_id to be present for differential isoform expression (by cuffdiff) when using any GTF other than the cufflinks' merged.gtf. I am using the following command to add the tss_id and p_id to my ensembl gtf:

cuffcompare \
  -o cuffcmp \
  -C -G \
  -r Mus_musculus.GRCm38.75.protein_linc.gtf \
  -s mm10.fa \
  Mus_musculus.GRCm38.75.protein_linc.gtf

To check whether I was doing it correctly, I checked the entries for a particular gene in both the input and output gtfs.

The 'gene' entry for Xkr4 in the original GTF looks like this:

chr1    protein_coding    gene    3205901    3671498    .    -    .    gene_id "ENSMUSG00000051951"; gene_name "Xkr4"; gene_source "ensembl_havana"; gene_biotype "protein_coding";

And, these are the entries corresponding to the above coordinates in the output GTF cuffcmp.combined.gtf:

chr1    processed_transcript    exon    3205901 3207317 .       -       .       gene_id "XLOC_000653"; transcript_id "TCONS_00002060"; exon_number "1"; gene_name "Xkr4"; oId "ENSMUST00000162897"; nearest_ref "ENSMUST00000162897"; class_code "="; tss_id "TSS1356";
chr1    processed_transcript    exon    3213609 3216344 .       -       .       gene_id "XLOC_000653"; transcript_id "TCONS_00002060"; exon_number "2"; gene_name "Xkr4"; oId "ENSMUST00000162897"; nearest_ref "ENSMUST00000162897"; class_code "="; tss_id "TSS1356";
chr1    processed_transcript    exon    3206523 3207317 .       -       .       gene_id "XLOC_000653"; transcript_id "TCONS_00002061"; exon_number "1"; gene_name "Xkr4"; oId "ENSMUST00000159265"; nearest_ref "ENSMUST00000159265"; class_code "="; tss_id "TSS1357";
chr1    processed_transcript    exon    3213439 3215632 .       -       .       gene_id "XLOC_000653"; transcript_id "TCONS_00002061"; exon_number "2"; gene_name "Xkr4"; oId "ENSMUST00000159265"; nearest_ref "ENSMUST00000159265"; class_code "="; tss_id "TSS1357";
chr1    protein_coding  exon    3214482 3216968 .       -       .       gene_id "XLOC_000653"; transcript_id "TCONS_00002062"; exon_number "1"; gene_name "Xkr4"; oId "ENSMUST00000070533"; nearest_ref "ENSMUST00000070533"; class_code "="; tss_id "TSS1358"; p_id "P1235";
chr1    protein_coding  exon    3421702 3421901 .       -       .       gene_id "XLOC_000653"; transcript_id "TCONS_00002062"; exon_number "2"; gene_name "Xkr4"; oId "ENSMUST00000070533"; nearest_ref "ENSMUST00000070533"; class_code "="; tss_id "TSS1358"; p_id "P1235";
chr1    protein_coding  exon    3670552 3671498 .       -       .       gene_id "XLOC_000653"; transcript_id "TCONS_00002062"; exon_number "3"; gene_name "Xkr4"; oId "ENSMUST00000070533"; nearest_ref "ENSMUST00000070533"; class_code "="; tss_id "TSS1358"; p_id "P1235";

In the output, the gene_id field has XLOC ids instead of Ensembl IDs. Can I fix this to have Ensembl IDs instead? Is there a better way to add tss_id and p_id to your Ensembl GTF?

cuffcompare • 9.4k views
ADD COMMENT
2
Entering edit mode
8.5 years ago
Malcolm.Cook ★ 1.5k

I have developed an Rscript, cuffdiff_gtf_attributes, which can provided the additional attributes p_id and tss_id as required by cuffdiff to perform all the differential splicing/coding/expression contrasts. I have tested it with Ensembl GTF.

ADD COMMENT
0
Entering edit mode

Thanks for the script, works great!

ADD REPLY
0
Entering edit mode
8.7 years ago

Hi,

You can download the correct gtf file from here I believe:

https://ccb.jhu.edu/software/tophat/igenomes.shtml

ADD COMMENT

Login before adding your answer.

Traffic: 3000 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6