I am attempting to run a comparison a batch of alignments (bam format) to assess spread. However, some of my datasets are larger than others and I wish to subsample down to an equal number of reads per sample.
If I wanted to randomly extract 1 Million reads from a bam file is there a method to do this?
Note: I am fully aware of the samtools and picard methods which allow you to reduce by a proportion (i.e. the flag -s 0.33) but that would not result in an fixed number of reads per sample, which is what I need, but a reduced proportion _per sample_ which doesn't help.
Bam subsampling has previously been talked about it but always the proportional data reduction, not the fixed number required: https://www.biostars.org/p/44527/, https://www.biostars.org/p/76791/#76796, https://www.biostars.org/p/110041, https://broadinstitute.github.io/picard/command-line-overview.html
Edit: Also, I've come across bamtools but haven't been able to get it to work and seems to have not been updated in quite some time
Is there a reason you cannot use bash like several have commented on in the posts you link to? For example, see https://www.biostars.org/p/44527/#126428 as it shows how to get 1 million randomly selected reads from your bam file.
I was hoping there would be a tool already existing without needing me to create 500gb of temporary files only to compress them again.
Well it depends on the downstream application, but you can often get around writing intermediates to disk by redirection ...
Yeah, I see. I just expected there to be something slick as I imagine it's a common process. Thanks for the info.
reformat.sh
from BBMap can accept asample=
parameter and can likely handle BAM files. Odds are good that that'll work well, but I've never tried.BTW, if reformat.sh won't do this then I'm tempted to just write a small program that will subsample a fixed number of alignments (or pairs for PE datasets) from SAM/BAM/CRAM files using reservoir sampling, since it's simple enough to do.
I've tried out BBmap as it purported to do as you said, but got some java errors and given @matted's answer below I've just decided to stick with that as I'm not Java savy. I tried all combinations of sam and bam for ins and outs too. If of interest:
Perhaps this wouldn't yield the error:
wow, yes, I'm an idiot. Didn't read it well enough as I was trying to rush it. It looks like it's running now. Good test of this versus the shuf method below at least, as they're both going simultaneously.
Rule 1: RTFM
I'd just like to add one comment...
This command will work fine, with 2 caveats. Firstly, reformat.sh will keep paired reads together for fastq and fasta input, but NOT for sam/bam input. Second, if you have secondary alignments with asterisks instead of bases, that is exactly how they will come out (unless you set primaryonly=t).
Cool, I'd be interested in hearing the difference in time required.
Edit : Though really this should be IO bound...
Think I'm going to have to walk away from it today and I didn't put a time count on it unfortunately. Will do a test tomorrow.