Merging many BAM files [solved]
3
0
Entering edit mode
8.2 years ago
biostart ▴ 370

Hello,

I am merging BAM files corresponding to several replicate experiments into one large BAM file. The resulting file is really large (400 Gb). I was using the "cat" command, and at some point connection to the server broke, so it could be that the end of the BAM file is a bit damaged (e.g. not all parts of the individual BAM files which I was merging are included in the final BAM file). Is it a problem? That is, for the biological analysis it should not be a problem if a small portion of reads is missing. But will it affect the structure of the binary structure of the BAM bile in such a way that it is unusable?

Thanks!

next-gen ChIP-Seq rna-seq • 5.3k views
ADD COMMENT
1
Entering edit mode
8.2 years ago

The BAM file itself will be broken and at least some downstream applications will fail. BTW, if the input BAM files were sorted it would have been preferable to use samtools merge.

As an aside, you rarely need to merge samples like this unless you're doing variant calling (and even then it's not needed much any more).

ADD COMMENT
0
Entering edit mode

Devon, when you suggest avoiding merging, which type of analysis do you have in mind? I need to calculate a heatmap, which would be representative of all the replicates (that is, an average of all replicates). Are there some tools that take as input multiple BAM files of replicates to construct a single heatmap?

ADD REPLY
2
Entering edit mode

What data do you want to put on the heatmap? If you plan to compute a coverage, you can compute coverage files per BAM and then add them together. Using deepTools you can compute coverages, add them together using bigwigCompare and finally create a heatmap.

ADD REPLY
0
Entering edit mode

I am trying to use your tool, and it looks very impressive, but also very sophisticated. I am historically more used to do this kind of analysis in HOMER. Do you know whether HOMER allows creating a single tag dir from multiple BAM files?

ADD REPLY
0
Entering edit mode

I am still unsure what you want to do. I tried to guessed it is coverage but can you elaborate?

ADD REPLY
0
Entering edit mode

I want to calculate enriched heatmaps, like these https://deeptools.readthedocs.org/en/latest/content/tools/plotHeatmap.html

The only problem is that the input files are as large as several hundreds of Gb (for a single heatmap)

ADD REPLY
0
Entering edit mode

Please define "calculate enriched heatmaps"?

The method you reference seems to suggest that you supply the plotHeatmap function with the output of computeMatrix. The function computeMatrix does not seem to take a BAM file. Instead it needs you to supply the regions you wish to score as a bed file, and the scores for those regions as a Wig file.

ADD REPLY
0
Entering edit mode

I am talking about the whole downstream analysis starting from the BAM files and ending with the heatmap like in the figure. Currently I am trying HOMER, but it has the following problem: it worked well for one cell type using about 200 Gb reads as input, but it failed to work for another cell type which has two times more reads.

ADD REPLY
0
Entering edit mode

Is there any reason why you can't process each sample individually, then average scores of replicates?

ADD REPLY
0
Entering edit mode

Two reasons: (1) processing each of the 20 samples individually is more work, and (2) I have no computational tool to average 20 heat maps into one single heatmap

ADD REPLY
0
Entering edit mode

I guess I'm a little confused, your post is about merging BAM files, but above you said that you had already run this with HOMER. Can you explain what is meant by "failed to work", did the program crash or did something else happen?

Are you sure you're not running into memory issues with such a large amount of data? Merging the BAM won't solve that.

ADD REPLY
0
Entering edit mode

I figured it out how to create a merged tag dir in HOMER by merging several precalculated HOMER dirs, avoiding directly merging BAM files. The problem though, is that HOMER works fine only up to a certain dir size. Perhaps it is a memory issue, since intermediate files might be >1Tb. For now, I have just decided to discard several replicates to downsize the problem.

ADD REPLY
0
Entering edit mode

deepTools won't have issues with a BAM file that large. Having said that, I would normally keep my replicates in separate files, since otherwise you can't assess variability. If you want to then do some comparison between groups, you can always get the data (as a numpy matrix or sometimes even a text file) and write a script to do whatever you want with it.

ADD REPLY
0
Entering edit mode

PS. I have started "samtools merge", and it is VERY slow. I guess it will take the whole day just to merge the files.

ADD REPLY
1
Entering edit mode

FYI, you can multithread the writing with the -@ option.

ADD REPLY
0
Entering edit mode

good advise about -@, thank you!

ADD REPLY
0
Entering edit mode

Almost anything, in fact the only time you actually need to merge biological replicates is for variant calling or ChIPseq when your sequencing depth is much too low. Oh and maybe with assembly. So unless your coverage is low you likely don't need to merge replicates.

ADD REPLY
1
Entering edit mode
8.2 years ago
pld 5.1k

The file will not work.

At the very least, you won't have the EOF marker and most programs reading BAMs will check for that. You'll probably have a partial entry that will make a mess of things as well.

It will be hard to tell how much of the data is missing, outside of comparing the size of the merged file with the total size of the input files. It could be a few reads, it could be a large portion of the reads.

I would start over completely and use nohup to start the job. This will start the job in a way that will keep it running even if your shell session ends (i.e. you disconnect).

Finally, I'm not sure that you want to use cat for this approach. I would use samtools merge to do this.

If you can, consider downsampling to reduce the size of the input files.

ADD COMMENT
0
Entering edit mode
8.2 years ago
biostart ▴ 370

So finally I have solved my task by downsizing the dataset (discarding several replicates).

ADD COMMENT

Login before adding your answer.

Traffic: 2685 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6