Biostar Beta. Not for public use.
Why do R1 and R2 compressed files have different size
0
Entering edit mode
11 months ago
MAPK ♦ 1.4k
United States

I have a transcriptome data of 10.8gb R1.fastq and R2.fastq each. I then compressed these pairs using gzip R1.fastq and gzip R2.fast2, and now the files are 2.2gb and 2.4gb. Is it possible for two compressed files to have different size when the uncompressed files are of same size?

fastq gzip • 622 views
ADD COMMENTlink
3
Entering edit mode

File sizes should never be used as quantitatve anything. Count the number of reads in both files if you want to be certain.

ADD REPLYlink
0
Entering edit mode

Thanks! I was submitting these pairs to NCBI sra and wanted to make sure this won't cause any problem.

ADD REPLYlink
0
Entering edit mode

As you know I had this problem last time with the SRA file where two files were asymetric. I just wanted to submit the compressed file this time. Yes the wc -l indicates same number for both files

ADD REPLYlink
1
Entering edit mode

Upload from a wired fast connection so there is no chance of corruption/interruption when doing the uploads.

ADD REPLYlink
4
Entering edit mode
15 months ago
swbarnes2 5.7k
United States

Yes. It's perfectly possible, even if the reads are the same length. One might have sequences that are a little more repetitive, and therefore more compressible. If they have the same number of lines, that's all that matters.

It of course also possible to run gzip with different levels of compression, but you don't seem to have done that. in this case.

ADD COMMENTlink
0
Entering edit mode

One might have sequences that are a little more repetitive

Mmm... The difference the OP observes is quite noticeable. If the sequence is the cause, it may indicate some problem as read1's and read2's should be pretty random with respect to the genomic position. See my answer below for an alternative explanation. (Unless by "sequence" you include also the quality string, in which case my answer is similar to yours)

ADD REPLYlink
2
Entering edit mode
13 months ago
WCIP | Glasgow | UK

A wild guess... Second-in-pair reads usually have base qualities that drops faster along the read compared to first-in-pair. This makes the quality line on each fastq record more variable (i.e. more random and less compressible) in R2 than in R1.

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1