I am in the process of analyzing and visualizing my RNA-seq data from a time course experiment (5 time points, no treatments).
I first mapped and aligned reads to my organism's reference genome using Tophat, then filtered the output bam files to maintain high quality/likelihood of correct alignment and to remove multi-mapped reads, then I used Cufflinks to assemble transcripts, and finally used HTseq-count to generate count data for DE analysis with an edgeR/limma based package.
I also want to perform a type of targeted heatmap/clustering analysis to look at larger lists of GOIs (that are not necessarily significantly DE) to determine which of these genes appear to change similarly, change the most over time, or have the highest expression levels vs. which ones do not appear to be induced or very highly expressed during this time course. I plan to use this information to weed out genes that are not likely involved in the process I’m studying, and to keep those demonstrating certain patterns and levels of temporal change in mind as candidates for later analysis. For example, some of the genes in my lists share functional names and have high sequence similarity and I would like to determine which specific genes within these groups of similarly annotated genes are most responsive throughout this time course.
Is it valid to use data derived from HTseq-count where multi-mapped reads have been discarded, even though this is not a DEG-based analysis? From my understanding, using these filtered data would help to avoid false positives, but I also realize that HTseq-count generates data for DE comparison between treatments (or in my case, time points) rather than for comparison between 2 or more genes. I want to be sure that I’m not drawing poor conclusions by using an inappropriately filtered version of my data that is useful for DEG analysis but not for other types of analyses.
Thanks in advance for any insights.