I did two study. One of them started with fastq files which i downloaded from Illumina BaseSpace,
another one is produced from BaseCalls with bcl2fastq2 program. But interestingly, there is some size difference between fastq files which their origin is different.
Is there any reason for that? Currently I'm working on a Diagnostic center so this is so important for me. Rush answers will be great.
File size is never a good indicator of similarity. Depending on storage architecture the same file may be of different sizes on different storage devices due to differences in sector sizes etc.
Have you looked at the counts of reads/total number of bases in the two datsets assuming they are otherwise identical? Keep in mind that BaseSpace may be trimming your data automatically where as standalone bcl2fastq can be setup not to do that by default.
Yes i've check number of lines and bases. Still there is a difference. For some samples, BaseSpace data is larger, for another, bcl2fastq-v2 generated data is larger. Not just size, number of reads and bases are different also
Are these identical samples being processed locally via bcl2fastq and also BaseSpace? Start looking at the scan/trim settings for both methods. While it should not make a difference in theory, are you using the latest bcl2fastq (v.2.20) locally?
What about settings? Are you using "fastq only" for bcl2fastq in your samplesheets? Same setting for BaseSpace? What is the run configuration (cycles x cycles, index)?
If yes, then you are going to have to start digging into the files to see where the differences are.
For starters, you need to provide the command line used on bcl2fastq, and see if you can find the settings used when BaseSpace made the fastqs. One obvious thing, while the default compression level in bcl2fastq is 4, it could be set to anything 1-9. This could make the files appear bigger, even if they contain the same amount of info. I understand that this does not explain the whole discrepancy in your case. You might also check to see if one or the other included reads that did not pass filters. bcl2fastq by default will not include these, but perhaps it was run to include these.
Yes i've check number of lines and bases. Still there is a difference. For some samples, BaseSpace data is larger, for another, bcl2fastq-v2 generated data is larger. Not just size, number of reads and bases are different also
Are these identical samples being processed locally via
bcl2fastq
and also BaseSpace? Start looking at the scan/trim settings for both methods. While it should not make a difference in theory, are you using the latestbcl2fastq
(v.2.20) locally?huseyin@tani-merkezi:~$ bcl2fastq --version BCL to FASTQ file converter bcl2fastq v2.20.0.422 Copyright (c) 2007-2017 Illumina, Inc.
yes version is latest.
What about settings? Are you using "fastq only" for
bcl2fastq
in your samplesheets? Same setting for BaseSpace? What is the run configuration (cycles x cycles, index)?If yes, then you are going to have to start digging into the files to see where the differences are.