Question

Data Management for NGS methylation dataset

0

Entering edit mode

8.2 years ago

Shicheng Guo ★ 9.4k

Dear All,

As the rise of the NGS methylation data, such as GWBS, RRBS, we can collect more and more data for certain tissue or certain disease from SRA and GEO database. However, How to manage these data?

Since the files are usually very big. I can't keep them in the hard-disk for long time. currently, I will alignment and build the wig file to UCSC,

however, there is one critical problem for UCSC Visualizing: when we have several wig files which contain case and control, then, it is very easy to use to show the different regions. However, when the sample size or the files increased largely. How to use it? Suppose we have 200 wigs (100 case and 100 control)

Any better idea, how to collect more NGS methylation data and re-analysis them day by day as the increase of the sample?

Best regards

DNA-methylation NGS • 1.6k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.2 years ago by Shicheng Guo ★ 9.4k

1

Entering edit mode

Since the read depth required for WGBS is significant, the FASTA/BAM files, as you say, are really large (100s of Gb).
However the processed data - % methylation and total number of calls for each CpG - is quite small, and readily indexable in any of the binary formats that exist. I would keep those on your local hard drive, but dump the raw data to magnetic tape, BlueRay disks, or some cheap 4Tb hard drives RAID'd (1, 4 or 5) together. What I think bioinformaticians should remember when it comes to long-term storage is that it will most likely be cheaper in 10 years to re-sequence the tissue (and generate up-to-date, highest-quality data) than it is to store raw data for 10 years, and then re-analyse it. The only reason you really need to keep raw data is if:
1) The biological sample is unique and irreplaceable.
2) Legal requirements.
3) Good possibility the sample was analysed wrong.

Regarding visualization of 200 tracks, if each track was 10px high (which is tiny), you would need a monitor that was at least 4K Ultra HD to display it.

If you wanted 30px per track, you would need an IMAX cinema... so obviously something is wrong with this approach ;)

Far better you think of a way to merge your datasets and visualize the distribution of them all at once. For example, calculate the median, -2 Standard Deviation, and +2 Standard Deviation, for each bp/bin of the genome, and then plot just three tracks (ideally overlaid).

ADD REPLY • link 8.2 years ago by John 13k