Understanding Illumina Hisq Output
2
0
Entering edit mode
8.3 years ago

I have output from an illumina HiSeq run, a series of html, zip, and fastq.gz files which are confusing to me.

'X' takes the value 1 and 2, there are 8 sets of files as shown below:

N2_gDNA_00X_SX_L00X_R1_001_fastqc.html
N2_gDNA_00X_SX_L00X_R1_001_fastqc.zip
N2_gDNA_00X_SX_L00X_R1_001.fastq.gz

My goal is to downsample this data by 50% using seqtk into one fa file which I can then use in a pipeline.

I am assuming the different sets of files are from different flow cells?

Why is there both a fastqc,zip file and fastq.gz file?

Do I just concatenate the files form different flowcells together?

sequence • 3.7k views
ADD COMMENT
4
Entering edit mode
N2_gDNA_00X_SX_L00X_R1_001_fastqc.html
N2_gDNA_00X_SX_L00X_R1_001_fastqc.zip

are the quality controls produced by http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

your raw reads are only in N2_gDNA_00X_SX_L00X_R1_001.fastq.gz

ADD REPLY
1
Entering edit mode
6.2 years ago

While Pierre and h.mon have addressed all your questions, I just wanted to pick up your assumption that

the different sets of files are from different flow cells

Without knowing the actual samples that you sequences, we cannot completely extract all the information you may need.

There are various aspects to the naming scheme *00X_SX_L00X_R1, some seem to be more intuitive than others. Let's tackle it from the end:

  • R1: probably stands for "read 1". There are single-read and paired-end protocols for sequencing; if all your samples end with R1, this would indicate that the data was sequenced in single read mode, i.e., one read per DNA fragment. The paired-end protocol would produce two reads per fragment, one covering the 3' and one covering the 5' end, both would be stored in separate files (and are typically kept separate, at least in the fastq files)
  • L00X: probably stands for "lane 1" and "lane 2". Illumina's flow cells have several lanes into which the DNA is transferred, the sequencing results from each lane will be reported in separate fastq files. If all your samples are going to take up more than one lane (there's a limit to how much can be loaded per lane), it is good practice to barcode every sample and distribute all samples across all lanes [1]. This way, the possible technical effect that an individual lane may have will not be confounded with the distinct samples. It also means, however, that every sample's sequencing results will be split into different fastq files (one for each lane). This is why h.mon suggested to simply concatenate the files that belong to the same sample into one fastq file, i.e. $ cat N2_gDNA_001_S1_L00*_R1.fastq.gz > N2_gDNA_001_S1_R1.fastq.gz.

Notice how I assume that SX stands for sample. It could also just mean "sequencing run" or "sunset" or nothing at all. I also have no idea what 00X stands for. These are the details you would either be able to deduce if you actually know how many samples you submitted and whether they were sequenced in one or multiple runs.

In regard to the fastqc* files, I would highly recommend to actually have a look at them. These are basic quality controls that will tell you whether the sequencing was good. The link to the FastQC tool that Pierre provided above will also lead you to decent documentation and you may also find pages 13-18 of these course materials helpful.

ADD COMMENT
1
Entering edit mode

Illumina fastq naming convention changed over the years, once upon a time even the barcode was included in the filename.

Illumina uses underscore to separate the filename fields (which are, in fact, metadata), so using underscores on sample names (as in this case: N2_gDNA_00X) may bring problems if using BaseSpace or other Illumina tools. In reality I don't know, as I never used BaseSpace for data analysis, but at the very least you can not upload a fastq file to BaseSpace if it doesn't follow certain conventions - see here and here.

The current Illumina convention is that, indeed, SX is sample, but the number refers to the sample number at the sample sheet, such as the first sample found at the sample sheet is S1, the second is S2, and so on.

The first 00X in his file names probably means something, but as you pointed out, we really have no way to know what.

Before BaseSpace, the 001 just before .fastq.gz could vary. Files were (well, still are) too big and had to be transferred by ftp. There was (it still exists, but I guess rarely used nowadays) a bcl2fastq parameter for splitting each sample in a fixed number of sequences, so when a file reached that number of sequences, a new file would be created, incrementing this last segment.

ADD REPLY
0
Entering edit mode
8.3 years ago
h.mon 35k

For several applications, digital normalization is more efficient than plain downsampling.

As Pierre Lindenbaum said, only the .fastq.gz are raw reads.

You may concatenate all files that were automatically split up by Illumina's pipeline:

cat N2_gDNA_00X_SX_L00X_R1_*.fastq.gz > N2_gDNA_00X_SX_L00X_R1.fastq.gz
cat N2_gDNA_00X_SX_L00X_R2_*.fastq.gz > N2_gDNA_00X_SX_L00X_R2.fastq.gz

But do not concatenate R1 and R2 together. Depending on your downstream analyses, you may or may not want to concatenate same samples from different lanes - e.g., for differential gene expression, you may want to keep then separate and check for batch effects.

ADD COMMENT

Login before adding your answer.

Traffic: 1846 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6