Annotation for gene-level analysis after stringtie
0
0
Entering edit mode
3.7 years ago
atorson • 0

Hi all,

Forgive me if this question has been posted before (I'm new to this forum). In the past, when doing transcript-level RNA-seq analyses, I have used the gffread utility to pull transcript sequences for annotation and downstream functional analyses.

i.e.

gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf

However, I am currently working on a gene-level analysis and am having issues doing something similar - especially in regard to evaluating novel loci in my Stringtie assemblies. The way I see it, I have one of two options:

1) Use my RefSeq IDs for non-novel loci and annotate novel loci separately and then merge these annotation lists prior to functional analysis. (If I go this route, I still need to generate a FASTA with the novel loci anyway)

or

2) Pull all gene/loci sequences and re-annotate everything

Is there a (relatively) simple work-around for this?

RNA-Seq stringtie tuxedo annotation • 2.1k views
ADD COMMENT
0
Entering edit mode

to which do you want a work-around?

"Pull all gene/loci sequences" or "re-annotate everything" or both?

ADD REPLY
0
Entering edit mode

Possibly both, since I'm not sure which one will work better for me when I'm annotating downstream. Since my refseq GTF had 14,6XX genes and my new merged assembly (using Taco) has about 24,000, so I've got a lot of novel loci to deal with. My plan now is to carry the orginal refseq IDs through (primarily because I can more easily do functional annotation with those IDs) for the reference loci and then do protein coding potential prediction and annotation of the novel loci. It seems to me that a work-around to "pull all gene/loci sequences" and then just filter out novel loci and annotating those could be an effective way to get about this. Thoughts?

ADD REPLY
0
Entering edit mode

that plan does sound fair indeed.

to 'annotate' the new transcripts you could submit them to some online resources that look for protein and function in a (semi) automated way. One that comes to mind is Trapid, but there are others around for sure.

ADD REPLY
0
Entering edit mode

Yeah, that makes sense (and I'll have a look at Trapid!). However, my primary issue here is that I can't figure out how to construct a FASTA file for the gene/loci sequences. Pulling the transcript sequences with gffread is straightforward, but gene/loci on the other hand...

ADD REPLY
0
Entering edit mode

normally extracting the transcripts should be enough (if it is for annotation that is), no need to get the complete loci

if you have a gff/gtf file from the novel transcripts you can extract them with for instance gffread from bedtools or using the AGAT package

ADD REPLY
0
Entering edit mode

If you provide stringtie with the reference annotation when you assemble, then it will do 1 for you - that is known transcripts will get their known transcript ids and novel transcripts of known genes will get their known gene id. Only novel transcripts of novel genes will get a novel gene id.

ADD REPLY
0
Entering edit mode

That's correct, I can easily ID which genes are novel when I include a reference annotation in my pipeline, but my issue is actually getting the gene sequences (specifically the novel loci) as a FASTA, so I can do the annotation downstream.

ADD REPLY
0
Entering edit mode

What do you mean by "annotate"? Annotate in what way? For what purpose?

If you want just sequences of the novel I'd do something like:

grep 'gene_id "MSTRG[0-9]+"' transcripts.gtf > novel_genes.gtf
gffread -w transcripts.fa -g /path/to/genome.fa novel_genes.gtf

Is the idea that you want try to attach something like domain or GO annotations to the novel genes?

ADD REPLY

Login before adding your answer.

Traffic: 2601 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6