Question

Understanding Illumina Hisq Output

0

Entering edit mode

8.3 years ago

progression1990 • 0

I have output from an illumina HiSeq run, a series of html, zip, and fastq.gz files which are confusing to me.

'X' takes the value 1 and 2, there are 8 sets of files as shown below:

N2_gDNA_00X_SX_L00X_R1_001_fastqc.html
N2_gDNA_00X_SX_L00X_R1_001_fastqc.zip
N2_gDNA_00X_SX_L00X_R1_001.fastq.gz

My goal is to downsample this data by 50% using seqtk into one fa file which I can then use in a pipeline.

I am assuming the different sets of files are from different flow cells?

Why is there both a fastqc,zip file and fastq.gz file?

Do I just concatenate the files form different flowcells together?

sequence • 3.7k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.3 years ago by progression1990 • 0

4

Entering edit mode

N2_gDNA_00X_SX_L00X_R1_001_fastqc.html
N2_gDNA_00X_SX_L00X_R1_001_fastqc.zip

are the quality controls produced by http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

your raw reads are only in N2_gDNA_00X_SX_L00X_R1_001.fastq.gz

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by Pierre Lindenbaum 161k

score 1 · Answer 1 · 2018-02-21

While Pierre and h.mon have addressed all your questions, I just wanted to pick up your assumption that

the different sets of files are from different flow cells

Without knowing the actual samples that you sequences, we cannot completely extract all the information you may need.

There are various aspects to the naming scheme *00X_SX_L00X_R1, some seem to be more intuitive than others. Let's tackle it from the end:

R1: probably stands for "read 1". There are single-read and paired-end protocols for sequencing; if all your samples end with R1, this would indicate that the data was sequenced in single read mode, i.e., one read per DNA fragment. The paired-end protocol would produce two reads per fragment, one covering the 3' and one covering the 5' end, both would be stored in separate files (and are typically kept separate, at least in the fastq files)
L00X: probably stands for "lane 1" and "lane 2". Illumina's flow cells have several lanes into which the DNA is transferred, the sequencing results from each lane will be reported in separate fastq files. If all your samples are going to take up more than one lane (there's a limit to how much can be loaded per lane), it is good practice to barcode every sample and distribute all samples across all lanes [1]. This way, the possible technical effect that an individual lane may have will not be confounded with the distinct samples. It also means, however, that every sample's sequencing results will be split into different fastq files (one for each lane). This is why h.mon suggested to simply concatenate the files that belong to the same sample into one fastq file, i.e. $ cat N2_gDNA_001_S1_L00*_R1.fastq.gz > N2_gDNA_001_S1_R1.fastq.gz.

Notice how I assume that SX stands for sample. It could also just mean "sequencing run" or "sunset" or nothing at all. I also have no idea what 00X stands for. These are the details you would either be able to deduce if you actually know how many samples you submitted and whether they were sequenced in one or multiple runs.

In regard to the fastqc* files, I would highly recommend to actually have a look at them. These are basic quality controls that will tell you whether the sequencing was good. The link to the FastQC tool that Pierre provided above will also lead you to decent documentation and you may also find pages 13-18 of these course materials helpful.

Ram · Answer 2 · 2016-01-05

For several applications, digital normalization is more efficient than plain downsampling.

As Pierre Lindenbaum said, only the .fastq.gz are raw reads.

You may concatenate all files that were automatically split up by Illumina's pipeline:

cat N2_gDNA_00X_SX_L00X_R1_*.fastq.gz > N2_gDNA_00X_SX_L00X_R1.fastq.gz
cat N2_gDNA_00X_SX_L00X_R2_*.fastq.gz > N2_gDNA_00X_SX_L00X_R2.fastq.gz

But do not concatenate R1 and R2 together. Depending on your downstream analyses, you may or may not want to concatenate same samples from different lanes - e.g., for differential gene expression, you may want to keep then separate and check for batch effects.