Question

Forum:Archiving BAM files and analysis

1

Entering edit mode

8.8 years ago

seidel 11k

I'm curious to know what conventions people adopt in terms of archiving their working analysis, especially when someone leaves the organization. I'm not talking about for publication - there are solutions for that. I'm talking about the equivalent of a lab notebook. Most labs keep their scientific notebooks, as references for future projects, protocols, etc. But for computational types, analysts with directories in the file system, their "notebooks" are typically their directories. Sometimes those directories are full of large files. BAM files, in particular, usually contain the command-line that was used to generate the file, it can be viewed with samtools, e.g.:

samtools view -H accepted_hits.bam

[...]
@PG  ID:TopHat  VN:2.0.10  CL:/n/local/bin/tophat --GTF mm10.Ens_73.cuff.gtf -p 3 -g 1 -o s_7_1_AGTCAA.tophat bowtie-index/mm10 C337LACXXa/s_7_1_AGTCAA.fastq.gz

Once a researcher leaves, an organization has to make a choice about what to do with their directories. Scientific Notebooks are essentially kept forever. For analysts, I don't see why not to do the same thing in terms of their directories except that keeping large files may not be necessary if they have essentially served their purpose, and the commands that generated them are known. Thus for archiving purposes, if the directory would just otherwise be removed, is there any reason not to simply replace a BAM file with the command used to generate it? i.e. accepted_hits.bam -> accepted_hits.bam.cl (assuming it is not actively being used in any projects?)

What do people do with /home/user or other directories for people that leave? Are there guidelines in place in your organization for dealing with large files that are not primary data?

archive BAM • 2.4k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 8.8 years ago by seidel 11k

0

Entering edit mode

You can generate fastq from bam but not the opposite, you should archive the software (with dependencies) and reference sequence as well to do so.

ADD REPLY • link 8.8 years ago by Asaf 10k

Ram · Answer 1 · 2015-06-24

I use a program called shlog (shell log) which you append to every command you want to log, so:

samtools index file.bam

would just become

log samtools index file.bam

Its just a basic Python script which sub-executes your original command, but also logs a bunch of stuff before/after it runs - time, user, hostname, all files used in the command, and files modified (based on MD5), all files added or deleted to the current and named folders - stuff like that.
The 'execution event' gets a unique ID (like an MD5) itself, and all the above data is sent to the cloud and added to a graph database. Other peeps can then MD5 any file ever shlog'd (even assuming they dont have shlog themselves), paste the MD5 into the shlog website, and see a full history of how that file came to be, right back to the fastq and up off to any future analysis/graphs made from the .bam.
It also backs up any created/modified/deleted/used files to a local backup folder (preventing duplicates based on MD5 and only up to a certain filesize), so the exact command can be re-run even if it used version 0.2.5a of script.pl you wrote 5 years earlier.
It can e-mail you when execution is done, or even call your phone if you have a Twilio account. It was pretty rad seeing as I wrote it when i just started learning Python.

It never really took off though. People are actually quite opposed to keeping their records in the cloud, even if not publicly available.
Shame because transparency is kind of neat.

score 1 · Answer 2 · 2015-06-24

1

Entering edit mode

8.8 years ago

Pierre Lindenbaum 161k

We work with Makefiles. We always save the Makefile that was used to generate the data.

ADD COMMENT • link 8.8 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

same here, Makefiles in a github repository. There is a separate folder per project. We also add final results (if not tool large into the github repository).

ADD REPLY • link 8.8 years ago by Istvan Albert 100k

0

Entering edit mode

Yes, I use Makefile quite a bit, and am happy with it as a recipe for generating or re-generating analysis files as needed.

ADD REPLY • link 8.8 years ago by seidel 11k

Ram · Answer 3 · 2015-06-25

0

Entering edit mode

8.8 years ago

DG 7.3k

Basically it boils down to keep the essential data files (raw FastQ for instance) and all of the scripts/commands and associated things necessary for re-doing that analysis (reference files, annotation files, databases, etc). Then for everything else it is a choice between how much space it takes to archive, and what you can spare, versus how much effort and time it will take to regenerate various analyzed files and derived data. It may very well be that you want to keep some of the downstream files (say the final BAM file after all post-processing, or the annotated VCF file, or whatever) but not all of the intermediate files or derived analytic files that wouldn't take long to recreate.

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.8 years ago by DG 7.3k

0

Entering edit mode

Yes, fastq files are considered primary data from the organization point of view, and certainly archived. I agree with the other things you mention - but it implies that a user directory is really like a notebook, kept around after they leave, and "curated" into an archival state by someone making choices about which files to keep and which can be re-generated. I'm surprised no one has actually commented on what their organization does for departed users.

ADD REPLY • link updated 16 months ago by Ram 43k • written 8.8 years ago by seidel 11k

1

Entering edit mode

I got an e-mail last week from my boss of 2.5 years ago, asking if I still had the raw sequencing data for an experiment because he lost it.

That was pretty much our policy.

ADD REPLY • link updated 16 months ago by Ram 43k • written 8.8 years ago by John 13k

0

Entering edit mode

Everywhere is different. It varies not only between institutions but between labs. While most labs are used to keeping lab notebooks for instance, and what to do with cell lines and the like, typically people's home directories and data were not routinely archived or kept. Bioinformatics heavy labs typically have their own policies in place. In my PhD we didn't have a very formal policy with regards to our electronic data, even among bioinformaticians. Of course for all of our published work the software was published and available, and we mostly worked on public datasets. So everything was reproducible.

For my Post-Doc we were generating lots of NGS data. So again, we archive FastQ files, I archived most completed downstream analysis files, and all of my software and tools are stored somewhere. Now I am Faculty in a Clinical lab setting, and we have very different requirements in terms of what data must be kept and for how long.

I am guessing in most labs it is largely the responsibility of the computational person (student, psot-doc, etc) to properly archive their work following whatever the lab has laid out. But yes, you're working directory should, in many ways, be considered in a similar fashion to a lab notebook and properly retained when you leave the lab.

ADD REPLY • link updated 16 months ago by Ram 43k • written 8.8 years ago by DG 7.3k