This is a beta test.
Question: NGS analysis: how to handle paired-end reads
Entering edit mode

I am learning how to analyse NGS data. I have a data for 192 samples. These were obtained through a targeted sequencing library prep.

I have 192 samples, but technically I have received 2 files for each sample. For example:

  • sample1_TTGCCTT_L008_R1_001.fastq.gz
  • sample1_TTGCCTT_L008_R2_001.fastq.gz

Presumably the reason there are 2 files is because paired-end sequencing was performed. I've been reading around on the various steps of NGS analysis but I can't seem to find an answer to the following question:

How to handle paired-end reads? Do you do have to merge them and if so when? Before alignment presumably? Also, do I have to uncompress the fastq.gz before I do anything? I am very new to NGS so apologies if these are really basic questions. Thanks.

ADD COMMENTlink 19 months ago m93 • 150 • updated 19 months ago manuel.belmadani • 830
Entering edit mode

You don't need to merge the R1/R2 reads. You don't say what kind of data this is but generally if you are aligning to a reference then you would use the two files together with an NGS aligner. Since the files contain reads from the same fragment their alignment to a reference provides spatial information.

All extant NGS tools should understand gzipped files. You should not need to decompress then during analysis (note: there may be some exceptions depending on very specific programs).

(An aside: If the reads are longer than the 1/2 size of the insert then they can overlap in the middle. )

Reads will overlap in this case

|------------------------------>100 bp|    R1 - 150 bp
|-------------------------------------|    Fragment 250 bp
|100 bp<------------------------------|    R2 - 150 bp

and will not here

|-------->                            |    R1 - 100 bp
|-------------------------------------|    Fragment 350 bp
|                           <---------|    R2 - 100 bp
ADD COMMENTlink 19 months ago genomax 68k
Entering edit mode

And if the reads are longer than the fragment then you'll sequence through the fragment into the adapters. This is why many pipelines include an adapter trimming step.

ADD REPLYlink 19 months ago
♦ 2.0k
Entering edit mode

Keep them separated, like Nicolas said, most modern NGS software should handle paired-end reads. One they're aligned, you should have a single SAM/BAM file containing reads from both ends.

Depending on your purpose, you may need to choose different tools. For variant calling using whole-genome-sequencing data, I used bwa mem for aligning the reads. A good resource would be the Broad's Best Practices Guideline, which would cover the alignment step (note what version the the Guideline you're using; I've used the one for GATK 3.0, and they recently updated to 4.0 so I can't comment on the latest one.)

For RNA-Seq, if you have Illumina short reads, you probably want a splice-aware aligner in order to detect cases like a read spanning an exon and part of a retained intron. I like STAR personally, and HISAT2 is also popular and a bit more recent one.

And .gzipped files are often supported; with STAR you simply specify --readFilesCommand zcat

My RNA-Seq pipeline uses STAR + RSEM for quantification of genes/transcripts.

ADD COMMENTlink 19 months ago manuel.belmadani • 830
Entering edit mode

Most of the modern tools for NGS (e.g. aligners) handle paired-end fastq.gz files. Just give them as input.

For example bwa mem :

bwa mem reference sample1_TTGCCTT_L008_R1_001.fastq.gz sample1_TTGCCTT_L008_R2_001.fastq.gz > alignment.sam
ADD COMMENTlink 19 months ago Nicolas Rosewick 7.7k

Login before adding your answer.

Powered by the version 1.6