I am engaged in research work, which is related to a processing of reads from a BAM file. And I need to predict the number of reads at start of my program, but without reading the whole file, because it is too long for big input files.
I thought that if I read the first few reads, I'll estimate the approximate size of the read, then, knowing the total size of the file, I can calculate the approximate number of reads in the file. Not exact, but approximate, this also suits me.
So the question: is there a big difference in the size between the reads in the file? For example, in my current data reads have size between 95 and 105 bytes. It's ok for me. But I'm not sure if it works for all other files.