Question

NGS analysis: how to handle paired-end reads

5

Entering edit mode

6.0 years ago

m98 ▴ 420

I am learning how to analyse NGS data. I have a data for 192 samples. These were obtained through a targeted sequencing library prep.

I have 192 samples, but technically I have received 2 files for each sample. For example:

sample1_TTGCCTT_L008_R1_001.fastq.gz
sample1_TTGCCTT_L008_R2_001.fastq.gz

Presumably the reason there are 2 files is because paired-end sequencing was performed. I've been reading around on the various steps of NGS analysis but I can't seem to find an answer to the following question:

How to handle paired-end reads? Do you do have to merge them and if so when? Before alignment presumably? Also, do I have to uncompress the fastq.gz before I do anything? I am very new to NGS so apologies if these are really basic questions. Thanks.

ngs paired-end reads analysis pipeline • 8.9k views

ADD COMMENT • link updated 6.0 years ago by manuel.belmadani ★ 1.3k • written 6.0 years ago by m98 ▴ 420

3

Entering edit mode

6.0 years ago

manuel.belmadani ★ 1.3k

Keep them separated, like Nicolas said, most modern NGS software should handle paired-end reads. One they're aligned, you should have a single SAM/BAM file containing reads from both ends.

Depending on your purpose, you may need to choose different tools. For variant calling using whole-genome-sequencing data, I used bwa mem for aligning the reads. A good resource would be the Broad's Best Practices Guideline, which would cover the alignment step (note what version the the Guideline you're using; I've used the one for GATK 3.0, and they recently updated to 4.0 so I can't comment on the latest one.)

For RNA-Seq, if you have Illumina short reads, you probably want a splice-aware aligner in order to detect cases like a read spanning an exon and part of a retained intron. I like STAR personally, and HISAT2 is also popular and a bit more recent one.

And .gzipped files are often supported; with STAR you simply specify --readFilesCommand zcat

My RNA-Seq pipeline uses STAR + RSEM for quantification of genes/transcripts.

ADD COMMENT • link 6.0 years ago by manuel.belmadani ★ 1.3k

2

Entering edit mode

6.0 years ago

Nicolas Rosewick 10k

Most of the modern tools for NGS (e.g. aligners) handle paired-end fastq.gz files. Just give them as input.

For example bwa mem :

bwa mem reference sample1_TTGCCTT_L008_R1_001.fastq.gz sample1_TTGCCTT_L008_R2_001.fastq.gz > alignment.sam

ADD COMMENT • link 6.0 years ago by Nicolas Rosewick 10k

score 5 · Accepted Answer · 2018-04-27

You don't need to merge the R1/R2 reads. You don't say what kind of data this is but generally if you are aligning to a reference then you would use the two files together with an NGS aligner. Since the files contain reads from the same fragment their alignment to a reference provides spatial information.

All extant NGS tools should understand gzipped files. You should not need to decompress then during analysis (note: there may be some exceptions depending on very specific programs).

(An aside: If the reads are longer than the 1/2 size of the insert then they can overlap in the middle. )

Reads will overlap in this case

|------------------------------>100 bp|    R1 - 150 bp
|-------------------------------------|    Fragment 250 bp
|100 bp<------------------------------|    R2 - 150 bp

and will not here

|-------->                            |    R1 - 100 bp
|-------------------------------------|    Fragment 350 bp
|                           <---------|    R2 - 100 bp