Hi all,
I have been receiving some confusing output with an RNA-seq dataset that I am trying to ultimately determine differential expression and fpkm values for.
Background: Library prep was stranded TruSeq (dUTP method) of mouse metatranscriptome and was given assembled contigs of transcriptome (Trinity) across 18 samples as well as RNA seq fastq files, performed on 4 lanes and two mate pair files, for each of these samples.
I ran HiSAT2 three times on one of the samples' pairs of reads with the following options (-fr, -rf and -ff) to make sure I am aligning using the correct strandedness (threw in -ff for funsies) but all gave the same output, with really only half maybe of the pairs aligning:
hisat2 -fr -x ../../contigs/Sample_120507/Sample_120507 -q -1 120507_CGATGT_S1_L001_R1_001.fastq -2 120507_CGATGT_S1_L001_R2_001.fa
stq -S fr.sam
RESULTS:
6185153 reads; of these:
6185153 (100.00%) were paired; of these:
5629996 (91.02%) aligned concordantly 0 times
475008 (7.68%) aligned concordantly exactly 1 time
80149 (1.30%) aligned concordantly >1 times
----
5629996 pairs aligned concordantly 0 times; of these:
1575716 (27.99%) aligned discordantly 1 time
----
4054280 pairs aligned 0 times concordantly or discordantly; of these:
8108560 mates make up the pairs; of these:
3511037 (43.30%) aligned 0 times
3273195 (40.37%) aligned exactly 1 time
1324328 (16.33%) aligned >1 times
71.62% overall alignment rate
hisat2 -rf -x ../../contigs/Sample_120507/Sample_120507 -q -1 120507_CGATGT_S1_L001_R1_001.fastq -2 120507_CGATGT_S1_L001_R2_001.fastq -S rf.sam
6185153 reads; of these:
6185153 (100.00%) were paired; of these:
5629996 (91.02%) aligned concordantly 0 times
475008 (7.68%) aligned concordantly exactly 1 time
80149 (1.30%) aligned concordantly >1 times
----
5629996 pairs aligned concordantly 0 times; of these:
1575716 (27.99%) aligned discordantly 1 time
----
4054280 pairs aligned 0 times concordantly or discordantly; of these:
8108560 mates make up the pairs; of these:
3511037 (43.30%) aligned 0 times
3273195 (40.37%) aligned exactly 1 time
1324328 (16.33%) aligned >1 times
71.62% overall alignment rate
hisat2 -ff -x ../../contigs/Sample_120507/Sample_120507 -q -1 120507_CGATGT_S1_L001_R1_001.fastq -2 120507_CGATGT_S1_L001_R2_001.fastq -S ff.sam
RESULTS:
6185153 reads; of these:
6185153 (100.00%) were paired; of these:
5629996 (91.02%) aligned concordantly 0 times
475008 (7.68%) aligned concordantly exactly 1 time
80149 (1.30%) aligned concordantly >1 times
----
5629996 pairs aligned concordantly 0 times; of these:
1575716 (27.99%) aligned discordantly 1 time
----
4054280 pairs aligned 0 times concordantly or discordantly; of these:
8108560 mates make up the pairs; of these:
3511037 (43.30%) aligned 0 times
3273195 (40.37%) aligned exactly 1 time
1324328 (16.33%) aligned >1 times
71.62% overall alignment rate
To further complicate things, I ran infer_experiment.py
from RseQC to double check that the data is indeed stranded and all my results came back as suggesting unstranded paired reads:
$ infer_experiment.py -i ff.sam -r ../Sample_120507/fragGeneScan.gff.bed
Reading reference gene model ../Sample_120507/fragGeneScan.gff.bed ... Done
Loading SAM/BAM file ... Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.0005
Fraction of reads explained by "1++,1--,2+-,2-+": 0.5232
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4764
$ infer_experiment.py -i rf.sam -r ../Sample_120507/fragGeneScan.gff.bed
Reading reference gene model ../Sample_120507/fragGeneScan.gff.bed ... Done
Loading SAM/BAM file ... Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.0005
Fraction of reads explained by "1++,1--,2+-,2-+": 0.5232
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4764
$ infer_experiment.py -i fr.sam -r ../Sample_120507/fragGeneScan.gff.bed
Reading reference gene model ../Sample_120507/fragGeneScan.gff.bed ... Done
Loading SAM/BAM file ... Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.0005
Fraction of reads explained by "1++,1--,2+-,2-+": 0.5232
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4764
But the sequencing core assured me that they were using a stranded protocol. I am thinking maybe the issues of low alignment as well as the [un]stranded issue may be related and due to something that I am missing here, or maybe there is not an issue here and the results are fine.
Any thoughts/suggestions would be much appreciated!
Interesting case. Double check with the sequencing core the library type they have used. Maybe there is a confusion somewhere... e.g
TruSeq RNA Sample Prep kit
is not a stranded library. You could give a try to GUESSmyLT to look more into details of the reads orientation.