Question

Whether to use all aligned reads (QC-passed) or only properly-paired reads from rnaseq data?

0

Entering edit mode

6.5 years ago

bioinfo8 ▴ 230

Hi,

I have multiple paired-end files from RNA-Seq which I have aligned using hisat2 and then converted to bam via samtools. Now I am confused whether I should use all the reads (QC-passed) or only properly-paired reads for further analysis.

Please advise.

Thanks.

RNA-Seq properly-paired samtools ngs hisat2 • 2.7k views

ADD COMMENT • link 6.5 years ago by bioinfo8 ▴ 230

1

Entering edit mode

For differential expression, there's nothing wrong with using all the reads. After all, one mate still mapped and provides information to help quantify gene expression. Paired-reads are great for inferring splicing events, but I'm not sure they help quantify expression significantly better than using single-end reads. You may want to consider whether some of the downstream tools you will use can work with BAM files which have mixed paired and unpaired reads.

ADD REPLY • link 6.5 years ago by James Ashmore ★ 3.4k

0

Entering edit mode

@James Ashmore: Ok, thanks.I am planning to use Stringtie and Ballgown

ADD REPLY • link 6.5 years ago by bioinfo8 ▴ 230

score 0 · Answer 1 · 2017-10-14

0

Entering edit mode

6.5 years ago

Kevin Blighe 87k

I think that your question can be answered when you are doing the alignment with hisat2, specifically by activating or leaving out the following 2 parameters:

--no-mixed

By default, when hisat2 cannot find a concordant or discordant alignment for a pair, it then tries to find alignments for the individual mates. This option disables that behavior.

--no-discordant

By default, hisat2 looks for discordant alignments if it cannot find any concordant alignments. A discordant alignment is an alignment where both mates align uniquely, but that does not satisfy the paired-end constraints (--fr/--rf/--ff, -I, -X). This option disables that behavior. [source: https://ccb.jhu.edu/software/hisat2/manual.shtml]

If you just wanted hisat2 to look at concordant mate-pairs, then include both of these parameters when you run the command (these were also available in TopHat2).

Looking at discordant alignments is of much utility in the area of fusion-gene analysis, and indeed other types of analyses looking at genomic re-arrangements.

Kevin

ADD COMMENT • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

I am already done with hisat2 alignment for my all samples (so many) and the ultimate aim of RNA-seq is to study the differential expression of genes between different samples. Therefore, considering that I asked the question.

ADD REPLY • link 6.5 years ago by bioinfo8 ▴ 230

0

Entering edit mode

I think that you should reconsider performing the hisat2 step again, i.e., if you were not sure about this key aspect of the process regarding read pairing. The hisat2 output will directly and indirectly affect all downstream analyses, including your differential express results.

However, if you are pressed for time, then you can ensure that only properly paired reads are retained by using: samtools view -f 2 -bS -o MyOutput.bam MyInput.bam (take a look at the SAM format flags here: http://broadinstitute.github.io/picard/explain-flags.html).

Additionlly, if you are just interested in gene expression over known transcripts (coding and non-coding), then I would encourage you to use Kallisto and count read abundances over the GENCODE reference transcriptome FASTA, which will save you weeks and possibly months depending on your sample n.

Good luck!

Kevin

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks. I have observed that using properly-paired reads will lead to more confidence in the data (for e.g. in case of exome analysis), but I want to know in case of RNA-Seq.

I would not consider doing hisat2 alignment again as it is so much time-consuming.

I used Kallisto, Sleuth approach, but I am not satisfied with it.

samtools flagstat for one of the bam after hisat2 mapping:

61491340 + 0 in total (QC-passed reads + QC-failed reads)
5392310 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
55668562 + 0 mapped (90.53% : N/A)
56099030 + 0 paired in sequencing
28049515 + 0 read1
28049515 + 0 read2
48144848 + 0 properly paired (85.82% : N/A)
48595888 + 0 with itself and mate mapped
1680364 + 0 singletons (3.00% : N/A)
79988 + 0 with mate mapped to a different chr
63343 + 0 with mate mapped to a different chr (mapQ>=5)

So, I feel that I should go for extraction of properly-paired.

ADD REPLY • link 6.5 years ago by bioinfo8 ▴ 230

1

Entering edit mode

Sure, I can see that you have large samples (61.5 million reads for this one). It generally looks good with 85.8% alignment but, due to the large size, the 3% singletons comprise 1.7 million reads!

Kevin

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k