Our lab are working on alternative splicing event in some non-model organisms
however, there are few mRNA (transcripts) in the annotation files, 1 transcript per gene in most cases. We thought that it is because the genome is poor annotated, so we collect a lot RNA-seq data and apply Hisat and StringTie to generate novel splice variants There are a lot more transcripts, reaching 2 transcript per gene in the output
However, we concern about the new transcripts, as the new transcripts exceed the gene region in the annotation file, in some case the whole exon exceed the gene region, and a few cases the whole transcript exceed the gene region..
May i have your advice on other tools or what do you think about the new transcripts? do you think they are acceptable? is there any handy tool to find new combination of exon from RNA-seq data and .gtf annotation file? thank you very much!!
thank you very much for your detail answer! what do you think about a whole new exon appear before or after the annotated gene region? Actually they make up large portion of our output. I've checked that some of them overlap with nearby gene while some are completely unannotated. if i removed them all, there will be not much left..
I'd be careful with exons that overlapped other genes - its not that its not possible, many human genes overlap, its just that there are reasons why they might also be artefacts. Is your RNA-seq stranded? I'd also want to be careful with new exons 3' of the ORF: classically we think of exon boundaries more than 50bp after the stop codon as triggering nonsense-mediated decay. There is no reason you shouldn't have exons 5' of the ORF though. Many genes don't have their start codon in the first exon, have alternate transcription start sites or whole alternate 5' exons.
we are from bioinformatics lab, and the RNA-seq data are collected from different experiments from different databases, i think most are not stranded. We apply a pipeline which will generate new transcripts, achieving our initial attempt to increase the number of isoforms, yet we worry if biologist will accept the result. i am now also searching for pipeline that will only generate new combination of exon. Many thanks for spending time with me!