parallel fastQC computation
1
1
Entering edit mode
9.5 years ago
dolevrahat ▴ 30

Hello

I want to parallelize a fastQC computation of a large fastq file.

I know that fastQC can be run on several files in parallel using the -t option, but I'm not sure if this will yield a correct result if I will split the file into several smaller files.

Is there any way to this?

Thanks in advance
Dolev Rahat

fastqc • 7.2k views
ADD COMMENT
0
Entering edit mode

Hi,

I would say split a fastq file into 2 files and run fastqc with -t option. Should work fine haven't tested though

ADD REPLY
0
Entering edit mode

To my knowledge fastQC operates on a sub-sample of your input file for its analyses, there's no need to actually process all reads for summary stats.

ADD REPLY
3
Entering edit mode
9.5 years ago

If you split the files into small parts then FastQC will run on each separately and the results are not combined.

As to an answer to your question - the problem (as always) is more complicated than it looks. In a perfect world splitting the file into random subsets would produce the similar results. But splitting into random subsets is not all that easy. In addition the way FastQC works is that it some operations will only collect information from the first 200K or so reads but will then track that information throughout the file.

Often and if the data is not skewed in particular ways you can get by analyzing a small subset of the file.

In other cases (as I have painfully learned myself) subsetting a dataset can produce reports that do not even remotely resemble the original data. But that had nothing to do with the way FastQC works.

Long story short I would run FastQC on all data then a subset of it - check if the library prep and the subject under study supports subsetting.

ADD COMMENT
0
Entering edit mode

Thank you for your insightful answer. The thing is that once I run FastQC on the entire dataset once I have no reason to run it on a subset of that dataset, because I already know the characteristics of the dataset, or am I'm missing something?

ADD REPLY
0
Entering edit mode

the idea is that if you have many similar samples only one would need to be fully processed.

ADD REPLY
0
Entering edit mode

In my practice a number of first reads are of a really bad quality. Basically one should never assume that a"first N reads" subset is the same in terms of certain parameters like quality as "random N reads"

ADD REPLY
0
Entering edit mode

I agree, it is best to sub sample though the way most of these work is that reads still maintain the original order, the sampler just skips some entries. But then properly shuffling the file would be more time consuming than running fastqc on it.

ADD REPLY

Login before adding your answer.

Traffic: 2350 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6