How many storage needs for Whole genome sequencing?
1
0
Entering edit mode
4.9 years ago
star ▴ 350

I am going to start the whole genome study and I like to have an estimation about the storage which is used for analyzing, I know it depends on several things like coverage, single end or paired-end and so on, but if anyone can help me and give an estimation it would be great!

I like to have WGS from 100 individuals with 50X coverage and paired-end.

Whole-genome-sequencing sequencing • 4.1k views
ADD COMMENT
1
Entering edit mode

It may be helpful to look through your own datasets to see what you have. Extrapolate from those numbers and add 10% to account for variation.

ADD REPLY
1
Entering edit mode

mentioning which genome/species you're working on might be the crucial bit of info here.

for viral genomes a few GBs will do, for conifer genomes for instance go think the in the range of several TBs

ADD REPLY
0
Entering edit mode

I am working on the Human Genome.

ADD REPLY
1
Entering edit mode
4.9 years ago
ATpoint 81k

If memory serves from a cohort I analyzed in the past of 20 matched-normal WGS (human, somewhat 30-50x, 2x100bp), one BAM file at compression level 5 was roughly 100GB. Roughly double that size for the raw data be it fastq or unaligned BAM would be 200GB per individual, times 100 would roughly equal 20TB. Definitely managable on a decent HPC with a parallel file system given you have access to such a cluster. Make sure you have the space, the CPU and memory resources and most importantly that your scripts are well-tested before starting the "big run".

ADD COMMENT
0
Entering edit mode

Thanks for the reply, do you have any clue about the output size of variant calling using GATK?

ADD REPLY
1
Entering edit mode

Genome Variant Call Format file (gVCF). gVCF was developed to store sequencing information for both variant and non-variant positions, which is required for human clinical applications. gVCF is a set of conventions applied to the standard variant call format (VCF) 4.1 as documented by the 1000 Genomes Project. These conventions allow representation of genotype, annotation, and other information across all sites in the genome in a compact format. Typical human whole-genome sequencing results expressed in gVCF with annotation are less than 1 Gbyte, or about 1/100 the size of the BAM file used for variant calling. If you are performing targeted sequencing, gVCF is also an appropriate choice to represent and compress the results.

ADD REPLY
0
Entering edit mode

No, never used it.

ADD REPLY

Login before adding your answer.

Traffic: 1496 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6