Hello. My colleague and I are trying to run a differential gene analysis to compare protein activity in cutthroat trout under normal and heat stressed condition. We are following in the footsteps of an earlier article that did pretty much the same thing in rainbow trout. While we wait until our data is available, we are trying to duplicate the steps followed by the rainbow trout researchers, who published their data sets as SRA files. Here's what we've done so far.
Acquire Data Files
1) Download the O.mykiss (rainbow trout) complete genome from NCBI in FASTA format.
2) Download the RNA-Seq data files DRR001887 and DRR001888 for the published paper. We have these in FASTQ format.
3) Download the GFF file from NCBI for the genome.
Initially, we tried to do the work in Galaxy 18, but we could never get it to use more than one CPU, so we ran the following commands manually. We thought we would use hisat2 for alignment and DESeq2 for differential genome expression. My guess is that the problem is somewhere near the end.
Phase 1 - use hisat2 to align the published SRA data against the reference O.mykiss genome. The goal is to create a SAM file, then convert to a BAM file. I also use samtools create a BAI file to aid visualization in IGV (which seems to work fine) and sort the BAM file. I use the the optionto run 4 threads.
1) sudo hisat2-build -p 4 dataset_6.dat newgome
2) sudo ln -f -s dataset_5.dat input_f.fast.gz
3) sudo hisat2 -p 4 -x 'newgenome' -U input_f.fastq.gz
4) sudo samtools view -bS -@ 4 first.sam > first.bam
5) sudo samtools index first.bam first.bai -@ 4
6) sudo samtools sort first.bam first_sorted.bam -@ 4
At this point, I have a sorted BAM file. My plan is to use this BAM file to generate gene expression counts with featureCounts. That is supposed to create a counts.txt file that can be run through DESeq2.
7) Install R 3.5 and related libraries, following directions on Bioconductor. After correcting some issues related to Ubuntu 18.04 LTS and incorrect repository, this step seems to go without incident.
At this point, I realize that I need the GFF genome annotation for O.mykiss. I download this from NCBI and gunzip it. 8) wget ... 9) gunzip -c GFC*.gz > O.mykiss.gff
I download and install featureCounts with sudo apt install . 10) featureCounts -T 4 -F 'GFF' -a O.mykiss.gff -o counts.txt first_sorted.bam
At this point, the command seems to run for a few minutes and does produce the counts.txt and counts.txt.summary files. When I run the featureCounts command from the console, I get a warning file about my GFF file not being in the correct format. It appears to run anyway. The output shows that it processed 860000 features and 48.3% of the assigned reads. (See attached screen shot.)
My concern is that when I look at the counts.txt file (29.5 MB), there are thousands of rows of data, but I don't see anything resembling gene names or counts. The files from other examples do not resemble mine. I feel like my counts.txt is coming back with nonsense because my GFF file is somehow incompatible or not in sync with my alignment BAM file. Since I am pulling data from multiple researchers, perhaps there is some disconnect.
I tried converting the GFF to GTF file with gffread, which worked, but did not yield anything different.
I also try a few command line switches in featureCounts for -g and -T, but this seems to yield nothing different.
Here are links to my counts.txt and GFF. Again, the GFF is not my own but from NCBI's page on the rainbow trout genome. The DRR SRA files are also from third party researchers.
https://crypticresponse.s3.amazonaws.com/static/rnaseq/counts.txt (29 MB)
https://crypticresponse.s3.amazonaws.com/static/rnaseq/counts.txt.summary
O.mykiss Genome in FASTA
- https://crypticresponse.s3.amazonaws.com/static/rnaseq/omykiss_genome.fasta
O.mykiss GFF file
- https://crypticresponse.s3.amazonaws.com/static/rnaseq/O.mykiss.gff
DRR SRA file #1
- https://crypticresponse.s3.amazonaws.com/static/rnaseq/DRR001887.sra
DRR SRA file #2
- https://crypticresponse.s3.amazonaws.com/static/rnaseq/DRR001888.sra
If any kind soul has any feedback on what I am doing wrong, I would greatly appreciate it. I am new to bioinformatics, just trying to get up to speed. I am happy to reconfigure my Linux VM or download other alternative tools, as needed. I have a VMware VM running with plenty of RAM and storage.
I'm happy to sweeten the deal with Amazon AWS credits or cash if anyone would show me the best practices for how to complete this exercise. I would even sponsor someone to visit us in Denver for a weekend if they felt like doing a little hand holding on this project. We think hisat2 / DESeq2 is the best approach based on reading through various exercises and tutorials, but we don't know enough to make an informed determination. We need to figure this out before we send off our pure RNA for sequencing this summer.
Thanks for all your help. George
Hi, I think your counts look ok. The ids used are transcripts as expected. Only settings I would add is strand options in featureCounts. Counts are in the last column with the file name as header. Cheers
Another thing, why are you running your commands via sudo? That is not recommended.