News:Free HiSeq X Ten human genome fastq test data
1
7
Entering edit mode
9.7 years ago
rbagnall ★ 1.8k

Download test data from the Illumina HiSeq X Ten from the Garven Institute, Australia, at the AllSeq website.

Fastq files available for NA12878D and NA12878J. Bam, Fastqc and Picard Mark duplicates metrics file available too.

Files available without registering until September 30, 2014

http://allseq.com/x-ten-test-data

HiSeq-X-Ten Human-genome Fastq next-gen • 5.8k views
ADD COMMENT
5
Entering edit mode
9.7 years ago

I am looking at the plot of kmer content. It looks a bit ... crazy ... as if most of the data were made of just a few patterns.

https://dnanexus-rnd.s3.amazonaws.com/NA12878-xten/fastqc-statistics/NA12878D_HiSeqX_R1.stats-fastqc.html#M9

Fastqc quality plot

ADD COMMENT
0
Entering edit mode

That is weird. maybe those are the only over-represented seqs. Even the CCCCC is only appearing at 15X expected rate.

Could also be something about fastqc's sampling (I assume they don't count everything).

ADD REPLY
0
Entering edit mode

Something is definitely wrong there. Their FastQC report indicates useless data.

I got intrigued and generated my own fastqc report. As it turns out mine is quite different than theirs, beyond just using a newer version of FastQC.

http://apollo.huck.psu.edu/data/NA12878D_HiSeqX_R1_fastqc.html

In my report the kmer content does actually make sense.

But then of course I can't help but wonder, if we can't even get the same FastQC report out of the data how are we going to reconcile more complicated information.

ADD REPLY
0
Entering edit mode

Their fastqc report is on the bam file NA12878D_HiSeqX_R1.bam, rather than the fastq file.

ADD REPLY
1
Entering edit mode

If both mapped and unmapped reads are included then using a BAM file should not make any difference.

What will make a difference (apparently in this case a huge one) is something that I have only realized at this very moment. When someone runs a FastQC on a sorted BAM file the results may end up biased towards the properties of the data that map at the start of the genome, whatever those may be. Kmer and sequence duplication only uses the first 200,000 or 2% of data. Normally raw data is not ordered in any predictable way relative to the genome..

Also the bam file contains both reads not just read1. I'll run the report for read2 by tomorrow.

ADD REPLY
0
Entering edit mode

I have now rerun the FastQC reports on each read file as well as on the BAM file. My plots are do not match the reports they have produced.

ADD REPLY
0
Entering edit mode

weird stuff going on around 50 bases.

ADD REPLY

Login before adding your answer.

Traffic: 1952 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6