Question

HiSAT2 RNAseq alignment low + stranded data predicted to be unstranded

1

Entering edit mode

5.1 years ago

causland • 0

Hi all,

I have been receiving some confusing output with an RNA-seq dataset that I am trying to ultimately determine differential expression and fpkm values for.

Background: Library prep was stranded TruSeq (dUTP method) of mouse metatranscriptome and was given assembled contigs of transcriptome (Trinity) across 18 samples as well as RNA seq fastq files, performed on 4 lanes and two mate pair files, for each of these samples.

I ran HiSAT2 three times on one of the samples' pairs of reads with the following options (-fr, -rf and -ff) to make sure I am aligning using the correct strandedness (threw in -ff for funsies) but all gave the same output, with really only half maybe of the pairs aligning:

hisat2 -fr -x  ../../contigs/Sample_120507/Sample_120507 -q -1 120507_CGATGT_S1_L001_R1_001.fastq -2 120507_CGATGT_S1_L001_R2_001.fa
stq -S fr.sam

RESULTS:
6185153 reads; of these:
  6185153 (100.00%) were paired; of these:
    5629996 (91.02%) aligned concordantly 0 times
    475008 (7.68%) aligned concordantly exactly 1 time
    80149 (1.30%) aligned concordantly >1 times
    ----
    5629996 pairs aligned concordantly 0 times; of these:
      1575716 (27.99%) aligned discordantly 1 time
    ----
    4054280 pairs aligned 0 times concordantly or discordantly; of these:
      8108560 mates make up the pairs; of these:
        3511037 (43.30%) aligned 0 times
        3273195 (40.37%) aligned exactly 1 time
        1324328 (16.33%) aligned >1 times
71.62% overall alignment rate


hisat2 -rf -x  ../../contigs/Sample_120507/Sample_120507 -q -1 120507_CGATGT_S1_L001_R1_001.fastq -2 120507_CGATGT_S1_L001_R2_001.fastq -S rf.sam

6185153 reads; of these:
  6185153 (100.00%) were paired; of these:
    5629996 (91.02%) aligned concordantly 0 times
    475008 (7.68%) aligned concordantly exactly 1 time
    80149 (1.30%) aligned concordantly >1 times
    ----
    5629996 pairs aligned concordantly 0 times; of these:
      1575716 (27.99%) aligned discordantly 1 time
    ----
    4054280 pairs aligned 0 times concordantly or discordantly; of these:
      8108560 mates make up the pairs; of these:
        3511037 (43.30%) aligned 0 times
        3273195 (40.37%) aligned exactly 1 time
        1324328 (16.33%) aligned >1 times
71.62% overall alignment rate


hisat2 -ff -x  ../../contigs/Sample_120507/Sample_120507 -q -1 120507_CGATGT_S1_L001_R1_001.fastq -2 120507_CGATGT_S1_L001_R2_001.fastq -S ff.sam

RESULTS:
6185153 reads; of these:
  6185153 (100.00%) were paired; of these:
    5629996 (91.02%) aligned concordantly 0 times
    475008 (7.68%) aligned concordantly exactly 1 time
    80149 (1.30%) aligned concordantly >1 times
    ----
    5629996 pairs aligned concordantly 0 times; of these:
      1575716 (27.99%) aligned discordantly 1 time
    ----
    4054280 pairs aligned 0 times concordantly or discordantly; of these:
      8108560 mates make up the pairs; of these:
        3511037 (43.30%) aligned 0 times
        3273195 (40.37%) aligned exactly 1 time
        1324328 (16.33%) aligned >1 times
71.62% overall alignment rate

To further complicate things, I ran infer_experiment.py from RseQC to double check that the data is indeed stranded and all my results came back as suggesting unstranded paired reads:

$ infer_experiment.py -i ff.sam -r ../Sample_120507/fragGeneScan.gff.bed
Reading reference gene model ../Sample_120507/fragGeneScan.gff.bed ... Done

Loading SAM/BAM file ...  Total 200000 usable reads were sampled


This is PairEnd Data
Fraction of reads failed to determine: 0.0005
Fraction of reads explained by "1++,1--,2+-,2-+": 0.5232
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4764


$ infer_experiment.py -i rf.sam -r ../Sample_120507/fragGeneScan.gff.bed
Reading reference gene model ../Sample_120507/fragGeneScan.gff.bed ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled


This is PairEnd Data
Fraction of reads failed to determine: 0.0005
Fraction of reads explained by "1++,1--,2+-,2-+": 0.5232
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4764


$ infer_experiment.py -i fr.sam -r ../Sample_120507/fragGeneScan.gff.bed
Reading reference gene model ../Sample_120507/fragGeneScan.gff.bed ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled


This is PairEnd Data
Fraction of reads failed to determine: 0.0005
Fraction of reads explained by "1++,1--,2+-,2-+": 0.5232
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4764

But the sequencing core assured me that they were using a stranded protocol. I am thinking maybe the issues of low alignment as well as the [un]stranded issue may be related and due to something that I am missing here, or maybe there is not an issue here and the results are fine.

Any thoughts/suggestions would be much appreciated!

RNA-Seq • 1.5k views

ADD COMMENT • link 5.1 years ago by causland • 0

0

Entering edit mode

Interesting case. Double check with the sequencing core the library type they have used. Maybe there is a confusion somewhere... e.g TruSeq RNA Sample Prep kit is not a stranded library. You could give a try to GUESSmyLT to look more into details of the reads orientation.

ADD REPLY • link 5.1 years ago by Juke34 8.6k