Convert genomic files from known Gigabases(# of bases) to gigabytes

0

Entering edit mode

14 months ago

datanerd ▴ 520

Hello,

I can convert a known coverage WGS fastq data to gigabytes . How do we convert known # of bases in a genomic file to a file size metric (Gigabytes)?

gigabases gigabytes genomics • 685 views

ADD COMMENT • link updated 14 months ago by GenoMax 142k • written 14 months ago by datanerd ▴ 520

1

Entering edit mode

This question is difficult to answer. Because sequence can be different the data may be compressible to different extent so the size of the file could vary depending on the content even if the same number of bases were present in the two files.

ADD REPLY • link 14 months ago by GenoMax 142k

1

Entering edit mode

Assuming we're going for just a FASTQ and not FASTQ.GZ, would the size still depend on QUAL scores or would all QUAL scores take up the same storage? Let's also assume that the third line is just +.

ADD REPLY • link 14 months ago by Ram 43k

1

Entering edit mode

That having said, simply take any of your fastq files at hand, count number of characters and see the size. Or simulate some reads (wgsim for example), check size and extrapolate. This is not precise as it depends on sequence composition if gzipped as composition determines 'compressability' but you get an idea.

ADD REPLY • link 14 months ago by ATpoint 82k

0

Entering edit mode

Again, compression is not part of the equation here. Does plain text FASTQ gigabyte<->gigabase follow a calculatable equation is the question I'm addressing.

ADD REPLY • link 14 months ago by Ram 43k

2

Entering edit mode

It would depend on the encoding being used. ASCII (1 byte or 8 bits per char) or one of the UNICODE (variable).

Total size (in Bytes) = ((Number of bits used to encode a single character) * (Number of characters))/8

ADD REPLY • link 14 months ago by GenoMax 142k

Login before adding your answer.