hisat2 parameters for paired end and single end reads from same samples
1
0
Entering edit mode
6.6 years ago
bioinfo17 ▴ 30

Hi,

I have paired end R1.fastq, R2.fastq and singletons.fastq files for the same samples. What parameters should I use for aligning reads against a genome of interest?

1) hisat2 -1 R1.fastq, -2 R2.fastq, -U singletons.fastq

or just treat all the files as unpaired and use

2) hisat2 -U R1.fastq, R2.fastq, singletons.fastq

Tophat had the options of using paired end and single reads together but unsure how hisat2 would work taking into account both paired end and single reads from the same samples. I will be using the bam files generated from hisat2 for downstream differential expression analysis using stringtie.

Should one ignore the singleton reads and just use the paired-end reads (use only -1, -2 and not -U) or Should treat all reads as unpaired and use -U R1.fastq, R2.fastq, singletons.fastq

How would stringtie calculate the counts of transcripts generated from hisat2 using different parameters as mentioned above?

(I had a look at the counts tables generated by 1 and 2 and I get completely different results. For example using method 1, I get a high number of counts of certain transcripts in sampleX and using method 2, I get no counts of the same transcripts in the same sample and sometimes the resulting counts are vice-versa in different samples.)

I found a similar post here but couldn't find a definitive conclusion: https://github.com/feltus/OSG-GEM/issues/10

Any advice will be appreciated, thanks.

hisat2 alignment RNA-Seq stringtie hisat • 7.9k views
ADD COMMENT
3
Entering edit mode
6.6 years ago

What parameters should I use

Parameters are a very delicate matter. You shouldn't ask other people for "parameters", rather you should state what you want to achieve. If the reads come from the same organism of the genome you're mapping them against (I assume this is your case) then the default parameters are close to the best ones.

just treat all the files as unpaired

No, never miss the chance to map reads as paired if they are. The insert size plays a role in this. Perhaps have an estimation of the insert size by mapping a subset of reads and then extract the positive values in the TLEN field of the output SAM file. Make a plot of the numbers you extracted and see where your peak is. Then you can set -I and -X (min and max accepted insert size), defining a range of, i don't know, 200 bp. The program will try to map pairs at that distance first, even if they contain mismatches, and that makes sense because that is their actual distance. If you treat them as unpaired, it will map each read separately and will find a lot of misalignments (hence your weird expression results).

Should one ignore the singleton reads

Nope, that's biological information. It makes your analysis harder but having a bam file with singletons and paired end reads inside is more biologically informative than having only the paired (remember singletons come out of quality trimming, not because their mate is unmapped).

How would stringtie calculate the counts of transcripts

Stringtie counts the number of reads that cover each position and then (if I'm not wrong) normalizes this number according to the number of isoforms that the gene has, redistributing counts among different isoforms depending on the evidence that they have. But for this one go through their manual, it's clearly stated.

ADD COMMENT
0
Entering edit mode

Estimation of the insert size is a great idea. I will follow that.

Stringtie counts the number of reads that cover each position

I have read the manual of stringtie but couldn't find an answer if it assembles transcripts based on both paired end and singletons reads and the counts generated will also include singletons. In short, does stringtie takes singletons into consideration for transcript assembly and estimation of counts?

Many thanks for your reply!

ADD REPLY
0
Entering edit mode

I am quite sure that it uses every resource available, it would be stupid otherwise. :) Regarding what I said on the insert size estimation, on a second thought there is some clarification to be made:

  • If your data are DNA reads, then estimating the insert size and limiting it to a range that is compatible with your library preparation makes sense
  • If your data are RNA reads mapped on a genome, I'm not sure it makes sense because you obtained them from spliced transcripts but you map them on a genome therefore having the spliced mode turned on (i.e. the mapping distance increases) -> it makes more sense to limit the intron size
  • If your data are RNA reads mapped on a transcriptome, it might make sense to estimate it and limit it again, but here I am not completely sure (there is a topic about this I opened here on BioStars, see Does it make sense to limit insert size when mapping on transcriptome? )
ADD REPLY

Login before adding your answer.

Traffic: 1968 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6