Question

1 base pair to 'x' byte conversion

0

Entering edit mode

4.3 years ago

kanika.151 ▴ 130

Hello all,

Does anyone know the base pair to byte conversion? I was recently asked that if each sample has 'y' million reads do you know how space it would occupy in our cluster?

How would you answer it?

fastq file cluster conversion base pairs byte • 3.4k views

ADD COMMENT • link 4.3 years ago by kanika.151 ▴ 130

1

Entering edit mode

Depending on actual sequence, files are going to compress more or less (similar sequences next to each other will compress better) so there is no way to make a size estimate before hand. You could generate totally random fake fastq sequence data and see what size the file occupies. That may be the largest size you would need to account for that particular data type (number of reads, length of cycles).

ADD REPLY • link 4.3 years ago by GenoMax 141k

0

Entering edit mode

Hello All,

What do you folks think of this article? https://bitesizebio.com/8378/how-much-information-is-stored-in-the-human-genome/

Do you think this conversion from this article can be used to give an estimation?

6×10^9 base pairs/diploid genome x 1 byte/4 base pairs = 1.5×10^9 bytes or 1.5 Gigabytes, about 2 CDs worth of space! Or small enough to fit 3 separate genomes on a standard DVD!

ADD REPLY • link 4.3 years ago by kanika.151 ▴ 130

0

Entering edit mode

It is not directly related. This aims to estimate byte-size of a diploid genome. You aim to estimate the size of a text file which is influenced by read length and length of header lines. The principle is the same though with 1 byte per character as I outlined above. Just take any random fastq file and subsample to 1mio reads (given it is the same read length) and then multiply accordingly to your read numbers.

ADD REPLY • link 4.3 years ago by ATpoint 81k

0

Entering edit mode

Thank you. I had taken read length into consideration as the data is paired-end. I will do as you all have suggested. Thanks again! :)

ADD REPLY • link 4.3 years ago by kanika.151 ▴ 130

score 1 · Answer 1 · 2020-01-08

I do not think this can generally be answered as fastq is pretty much always gzip-compressed and compressed file size depends on nucleotide composition. I guess you can (for uncompressed files) approximate it with 1byte per character (remember that each end of a line has a hidden newline` so +1 for that). Given that you have equal read length that would be probably be something like for each read (which has 4 lines).

  (number of characters per read header line = line1)    + 1
+ (number of characters per read sequence    = line2)    + 1
+ (1 for the + in line 3)                                + 1
+ (number of characters per read quality line   = line4) + 1

Or you simply make a dummy fastq file with the same read length as in your sample with a certain number of reads, get file size with ls -l and then multiply to match your number of actual reads.

Just to give you an idea, I checked a random fastq from a ChIP-seq experiment I had around, has 17.5mio reads, 50bp length, read headers around 16characters long:

$ ls -lh foo.fastq* && ls -l foo.fastq*

-rw-r--r-- 1 xx xx 2.3G Jan  8 10:11 foo.fastq
-rw-r--r-- 1 xx xx 460M Jan  7 18:18 foo.fastq.gz
-rw-r--r-- 1 xx xx 2369334530 Jan  8 10:11 foo.fastq
-rw-r--r-- 1 xx xx  482130248 Jan  7 18:18 foo.fastq.gz