Entering edit mode
14 months ago
datanerd
▴
520
Hello,
I can convert a known coverage WGS fastq data to gigabytes . How do we convert known # of bases in a genomic file to a file size metric (Gigabytes)?
This question is difficult to answer. Because sequence can be different the data may be compressible to different extent so the size of the file could vary depending on the content even if the same number of bases were present in the two files.
Assuming we're going for just a FASTQ and not FASTQ.GZ, would the size still depend on QUAL scores or would all QUAL scores take up the same storage? Let's also assume that the third line is just
+
.That having said, simply take any of your fastq files at hand, count number of characters and see the size. Or simulate some reads (wgsim for example), check size and extrapolate. This is not precise as it depends on sequence composition if gzipped as composition determines 'compressability' but you get an idea.
Again, compression is not part of the equation here. Does plain text FASTQ gigabyte<->gigabase follow a calculatable equation is the question I'm addressing.
It would depend on the encoding being used. ASCII (1 byte or 8 bits per char) or one of the UNICODE (variable).