I am learning how to analyse NGS data. I have a data for 192 samples. These were obtained through a targeted sequencing library prep.
I have 192 samples, but technically I have received 2 files for each sample. For example:
Presumably the reason there are 2 files is because paired-end sequencing was performed. I've been reading around on the various steps of NGS analysis but I can't seem to find an answer to the following question:
How to handle paired-end reads? Do you do have to merge them and if so when? Before alignment presumably? Also, do I have to uncompress the fastq.gz before I do anything? I am very new to NGS so apologies if these are really basic questions. Thanks.
You don't need to merge the R1/R2 reads. You don't say what kind of data this is but generally if you are aligning to a reference then you would use the two files together with an NGS aligner. Since the files contain reads from the same fragment their alignment to a reference provides spatial information.
All extant NGS tools should understand gzipped files. You should not need to decompress then during analysis (note: there may be some exceptions depending on very specific programs).
(An aside: If the reads are longer than the 1/2 size of the insert then they can overlap in the middle. )
Reads will overlap in this case
|------------------------------>100 bp| R1 - 150 bp
|-------------------------------------| Fragment 250 bp
|100 bp<------------------------------| R2 - 150 bp
and will not here
|--------> | R1 - 100 bp
|-------------------------------------| Fragment 350 bp
| <---------| R2 - 100 bp
Keep them separated, like Nicolas said, most modern NGS software should handle paired-end reads. One they're aligned, you should have a single SAM/BAM file containing reads from both ends.
Depending on your purpose, you may need to choose different tools. For variant calling using whole-genome-sequencing data, I used bwa mem
for aligning the reads. A good resource would be the Broad's Best Practices Guideline, which would cover the alignment step (note what version the the Guideline you're using; I've used the one for GATK 3.0, and they recently updated to 4.0 so I can't comment on the latest one.)
For RNA-Seq, if you have Illumina short reads, you probably want a splice-aware aligner in order to detect cases like a read spanning an exon and part of a retained intron. I like STAR personally, and HISAT2 is also popular and a bit more recent one.
And .gzipped files are often supported; with STAR you simply specify --readFilesCommand zcat
My RNA-Seq pipeline uses STAR + RSEM for quantification of genes/transcripts.
And if the reads are longer than the fragment then you'll sequence through the fragment into the adapters. This is why many pipelines include an adapter trimming step.