Question

Stringtie output files

1

Entering edit mode

5.8 years ago

chipolino ▴ 150

I am a new user of StringTie and probably this question is very simple but I still don't get it... I have my sorted bam files (HISAT2 output, genome v19) and here is my StringTie command (v1.3.4):

stringtie hisat2_work/hisat2/alignments.sorted.bam -o stringtie_results/transcripts.gtf -G genes.GRCh37.gtf --rf -A stringtie_results/gene_abund.tab

As a result I have two output files: gene abundances (gene_abund.tab) and transcript annotation file (transcripts.gtf). For example, if I open gene_abund.tab, I will see this line:

Gene ID Gene Name   Reference   Strand  Start   End Coverage    FPKM    TPM
ENSG00000223972 DDX11L1 1   +   11869   14412   0.180934    0.129907    0.341143

But if I search for gene name (and gene id) DDX11L11 in transcripts.gtf I don't see it, it's absent. At the same time, I can find other genes from gene_abund.tab in transcripts.gtf, for example:

line in gene_abund.tab:

ENSG00000227232 WASH7P  1   -   14363   29806   16.906973   12.345821   32.420803

corresponding line in transcripts.gtf:

StringTie   transcript  14363   29370   1000    -   .   gene_id "STRG.2"; transcript_id "STRG.2.2"; reference_id "ENST00000423562"; ref_gene_id "ENSG00000227232"; ref_gene_name "WASH7P"; cov "1.478912"; FPKM "1.061831"; TPM "2.788425";

What can be a problem here, why do I miss some genes from gene_abund.tab in my transcripts.gtf file?

RNA-Seq Stringtie GTF • 6.6k views

ADD COMMENT • link updated 5.8 years ago by Kevin Blighe 87k • written 5.8 years ago by chipolino ▴ 150

0

Entering edit mode

Hello and welcome to biostars,

to show commands you use and file contents you should use the code button (the one with 101 010). This makes your post much more readable.

This time I did it for you.

fin swimmer

ADD REPLY • link 5.8 years ago by finswimmer 16k

score 1 · Answer 1 · 2018-06-17

1

Entering edit mode

5.8 years ago

Kevin Blighe 87k

The one that was not included has coverage that falls below the threshold. It is virtually not expressed at all.

Modify the -C and -c parameter to StringTie:

-C <cov_refs.gtf> StringTie outputs a file with the given name with all transcripts in the provided reference file that are fully covered by reads (requires -G).

-c <float> Sets the minimum read coverage allowed for the predicted transcripts. A transcript with a lower coverage than this value is not shown in the output. Default: 2.5

Kevin

ADD COMMENT • link 5.8 years ago by Kevin Blighe 87k

0

Entering edit mode

I should additionally point out that DDX11L1 is a pseudogene. So, it makes sense that it may have minimal expression if it has no promoter sequence or TSS such that transcription at a meaningful level could occur.

ADD REPLY • link 5.8 years ago by Kevin Blighe 87k