For an RNA-Seq data set, I performed transcription start site usage analysis using cufflinks. The workflow was as follows: multiple replicates of each genotype (2) sequenced on an Illumina platform; the reads were aligned using STAR; performed assembly using cufflinks; assembled gtf files were merged using cuffmerge (with reference annotation included). Isoform quantification was performed using cuffquant.
I wanted to analyse differential TSS usage. However when I actually look at the differentially expressed TSS, I see that this includes several "novel" TSS, but many of these have a start which differs only 1 nucleotide from the reference. Is this just a mmapping issue, and what can I do to systematcally indentify and remove these that seem to be false positive novel TSS?
Thanks
Thanks Devon. Can you think of any automated way to do that? For the novel TSS groups that are adjacent to the gene start I could use a cut-off; but for other TSS more downstream that are also next to annotated TSS, I'm not sure how that would be feasible, even with the annotated files.
I'm also wondering with what confidence one can trust the differential expression analysis then - if reads are redistributed to these TSS that are not actually novel, then the value of the fpkm assigned to this TSS and the adjacent canonical TSS may be miscalculated. Would you agree with that rationalle?
bedtools closest
could be used to filter the novel TSSs. Regarding how reliable the results are for differential TSS usage I can't really say. To be honest, I've never personally used the differential TSS testing.