Convert genomic files from known Gigabases(# of bases) to gigabytes
0
0
Entering edit mode
14 months ago
datanerd ▴ 520

Hello,

I can convert a known coverage WGS fastq data to gigabytes . How do we convert known # of bases in a genomic file to a file size metric (Gigabytes)?

gigabases gigabytes genomics • 685 views
ADD COMMENT
1
Entering edit mode

This question is difficult to answer. Because sequence can be different the data may be compressible to different extent so the size of the file could vary depending on the content even if the same number of bases were present in the two files.

ADD REPLY
1
Entering edit mode

Assuming we're going for just a FASTQ and not FASTQ.GZ, would the size still depend on QUAL scores or would all QUAL scores take up the same storage? Let's also assume that the third line is just +.

ADD REPLY
1
Entering edit mode

That having said, simply take any of your fastq files at hand, count number of characters and see the size. Or simulate some reads (wgsim for example), check size and extrapolate. This is not precise as it depends on sequence composition if gzipped as composition determines 'compressability' but you get an idea.

ADD REPLY
0
Entering edit mode

Again, compression is not part of the equation here. Does plain text FASTQ gigabyte<->gigabase follow a calculatable equation is the question I'm addressing.

ADD REPLY
2
Entering edit mode

It would depend on the encoding being used. ASCII (1 byte or 8 bits per char) or one of the UNICODE (variable).

Total size (in Bytes) = ((Number of bits used to encode a single character) * (Number of characters))/8
ADD REPLY

Login before adding your answer.

Traffic: 2590 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6