Biostar Beta. Not for public use.
Stringtie output files
0
Entering edit mode
15 months ago
chipolino • 40

I am a new user of StringTie and probably this question is very simple but I still don't get it... I have my sorted bam files (HISAT2 output, genome v19) and here is my StringTie command (v1.3.4):

stringtie hisat2_work/hisat2/alignments.sorted.bam -o stringtie_results/transcripts.gtf -G genes.GRCh37.gtf --rf -A stringtie_results/gene_abund.tab

As a result I have two output files: gene abundances (gene_abund.tab) and transcript annotation file (transcripts.gtf). For example, if I open gene_abund.tab, I will see this line:

Gene ID Gene Name   Reference   Strand  Start   End Coverage    FPKM    TPM
ENSG00000223972 DDX11L1 1   +   11869   14412   0.180934    0.129907    0.341143

But if I search for gene name (and gene id) DDX11L11 in transcripts.gtf I don't see it, it's absent. At the same time, I can find other genes from gene_abund.tab in transcripts.gtf, for example:

line in gene_abund.tab:

ENSG00000227232 WASH7P  1   -   14363   29806   16.906973   12.345821   32.420803

corresponding line in transcripts.gtf:

StringTie   transcript  14363   29370   1000    -   .   gene_id "STRG.2"; transcript_id "STRG.2.2"; reference_id "ENST00000423562"; ref_gene_id "ENSG00000227232"; ref_gene_name "WASH7P"; cov "1.478912"; FPKM "1.061831"; TPM "2.788425";

What can be a problem here, why do I miss some genes from gene_abund.tab in my transcripts.gtf file?

ADD COMMENTlink
0
Entering edit mode

Hello and welcome to biostars,

to show commands you use and file contents you should use the code button (the one with 101 010). This makes your post much more readable.

This time I did it for you.

fin swimmer

ADD REPLYlink
1
Entering edit mode
13 months ago
Republic of Ireland

The one that was not included has coverage that falls below the threshold. It is virtually not expressed at all.

Modify the -C and -c parameter to StringTie:

-C <cov_refs.gtf> StringTie outputs a file with the given name with all transcripts in the provided reference file that are fully covered by reads (requires -G).

-c <float> Sets the minimum read coverage allowed for the predicted transcripts. A transcript with a lower coverage than this value is not shown in the output. Default: 2.5

Kevin

ADD COMMENTlink
0
Entering edit mode

I should additionally point out that DDX11L1 is a pseudogene. So, it makes sense that it may have minimal expression if it has no promoter sequence or TSS such that transcription at a meaningful level could occur.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3