Biostar Beta. Not for public use.
Subsample BAM to fixed number of alignments
3
Entering edit mode
5.6 years ago
Daniel ♦ 3.7k
@Daniel1123

I am attempting to run a comparison a batch of alignments (bam format) to assess spread. However, some of my datasets are larger than others and I wish to subsample down to an equal number of reads per sample.

If I wanted to randomly extract 1 Million reads from a bam file is there a method to do this?

Note: I am fully aware of the samtools and picard methods which allow you to reduce by a proportion (i.e. the flag -s 0.33) but that would not result in an fixed number of reads per sample, which is what I need, but a reduced proportion _per sample_ which doesn't help.

Bam subsampling has previously been talked about it but always the proportional data reduction, not the fixed number required: https://www.biostars.org/p/44527/, https://www.biostars.org/p/76791/#76796, https://www.biostars.org/p/110041, https://broadinstitute.github.io/picard/command-line-overview.html

Edit: Also, I've come across bamtools but haven't been able to get it to work and seems to have not been updated in quite some time

bam subsampling • 6.7k views
1
Entering edit mode

Is there a reason you cannot use bash like several have commented on in the posts you link to? For example, see https://www.biostars.org/p/44527/#126428 as it shows how to get 1 million randomly selected reads from your bam file.

0
Entering edit mode

I was hoping there would be a tool already existing without needing me to create 500gb of temporary files only to compress them again.

0
Entering edit mode

Well it depends on the downstream application, but you can often get around writing intermediates to disk by redirection ...

myprogram -f1 file1.bam -f2 <(samtools view file2.bam | head -1000000) -f3 file3.bam

0
Entering edit mode

Yeah, I see. I just expected there to be something slick as I imagine it's a common process. Thanks for the info.

0
Entering edit mode

reformat.sh from BBMap can accept a sample= parameter and can likely handle BAM files. Odds are good that that'll work well, but I've never tried.

0
Entering edit mode

BTW, if reformat.sh won't do this then I'm tempted to just write a small program that will subsample a fixed number of alignments (or pairs for PE datasets) from SAM/BAM/CRAM files using reservoir sampling, since it's simple enough to do.

0
Entering edit mode

I've tried out BBmap as it purported to do as you said, but got some java errors and given @matted's answer below I've just decided to stick with that as I'm not Java savy. I tried all combinations of sam and bam for ins and outs too. If of interest:

daniel-desktop:~/Raw_Data/ALD/SAMs~/Local/bbmap/reformat.sh -in ES09.bam -out ES09_1m.sam samplereadstarget=1000000 mappedonly=t java -ea -Xmx200m -cp /home/sbi6dap/Local/bbmap/current/ jgi.ReformatReads -in ES09.bam -out ES09_1m.sam samplereadstarget=1000000 mappedonly=t Picked up JAVA_TOOL_OPTIONS: -javaagent:/usr/share/java/jayatanaag.jar Executing jgi.ReformatReads [-in, ES09.bam, -out, ES09_1m.sam, samplereadstarget=1000000, mappedonly=t] Unknown parameter ES09_1m.sam Exception in thread "main" java.lang.AssertionError: Unknown parameter ES09_1m.sam at jgi.ReformatReads.<init>(ReformatReads.java:168) at jgi.ReformatReads.main(ReformatReads.java:45)  ADD REPLYlink 0 Entering edit mode Perhaps this wouldn't yield the error: ~/Local/bbmap/reformat.sh in=ES09.bam out=ES09_1m.sam samplereadstarget=1000000 mappedonly=t  ADD REPLYlink 0 Entering edit mode wow, yes, I'm an idiot. Didn't read it well enough as I was trying to rush it. It looks like it's running now. Good test of this versus the shuf method below at least, as they're both going simultaneously. Rule 1: RTFM ADD REPLYlink 1 Entering edit mode I'd just like to add one comment... This command will work fine, with 2 caveats. Firstly, reformat.sh will keep paired reads together for fastq and fasta input, but NOT for sam/bam input. Second, if you have secondary alignments with asterisks instead of bases, that is exactly how they will come out (unless you set primaryonly=t). ADD REPLYlink 0 Entering edit mode Cool, I'd be interested in hearing the difference in time required. Edit : Though really this should be IO bound... ADD REPLYlink 0 Entering edit mode Think I'm going to have to walk away from it today and I didn't put a time count on it unfortunately. Will do a test tomorrow. ADD REPLYlink 0 Entering edit mode Hi Daniel, Do you want to have 1M reads or 1M alignments? While sampling the latter you might need to control for the NH tag and the secondary alignment flag. Otherwise, you could extract the unique read ids, shuffle and select 1M reads and filter your BAM file with these ids. ADD REPLYlink 4 Entering edit mode 5.6 years ago matted 7.0k @matted5333 This is based on some of the earlier threads mentioned in the comments, but you can use the GNU shuf utility to get a fixed number of random lines in a file. Recent versions of the tool use reservoir sampling, which will improve efficiency. You could run something like: shuf -n 1000000 input.sam > output.sam  To get the header correctly, you could do something like: cat <(samtools view -SH input.sam) <(samtools view -S input.sam | shuf -n 1000000) > output.sam  ADD COMMENTlink 1 Entering edit mode Thanks, this seems a lot less onerous than when I first looked at the other options with the intermediate files. Pipes save the day. I'm still surprised that there isn't a canonical tool for this though. For completeness, taking BAM inputs, I'm using: cat <(samtools view -H input.bam) <(samtools view -q 255 input.bam | shuf -n 1000000) > output.sam  ADD REPLYlink 0 Entering edit mode My guess as to why there's no existing tools to do this (as far as we know) is that random subsampling gives answers that are almost correct, in terms of the number of output reads, and is easier to implement. If you take the sampled number of reads as a binomial random variable, the standard deviation is pretty tiny compared to the number of reads (at most 1000 reads when sampling 1M, which is 0.1% of the total). In fact, the standard deviation as a fraction of the expected number of sampled reads N is at most 1/sqrt(N). The argument might be that that 0.1% variation shouldn't cause problems, or if it does, you may already be in trouble. ADD REPLYlink 0 Entering edit mode reformat.sh will give an exact number of reads (or bases), but at the expense of reading the file twice. For an approximate answer, it only has to read the file once, which is faster. ADD REPLYlink 3 Entering edit mode 4.2 years ago @Alex Reynolds20 One approach is to: • Convert BAM to SAM with samtools • Strip the header from the parent SAM file • Use sample to sample 1M lines without replacement from the parent, headerless SAM, which uses reservoir sampling of line offsets to get around memory limitations in GNU shuf • Reheader the SAM-formatted sample • Recompress the reheadered, SAM-formatted sample to a BAM file In explicit code:  samtools view -h -o reads.sam reads.bam
$grep -v '^@' reads.sam > nohead_reads.sam$ sample --sample-size 1000000 --sample-without-replacement --preserve-order nohead_reads.sam > nohead_reads_1M_sample.sam
$samtools reheader nohead_reads_1M_sample.sam reads.bam$ samtools view -bS nohead_reads_1M_sample.sam -o reads_1M_sample.bam


This process can be easily automated and parallelized.

0
Entering edit mode

Just been having a poke around the code for your sample program. Good stuff, and quick too! That's gone straight into my /u/l/bin!

Seeing as I did have the source, though, I did reformat the help text to fit 80 characters ;) You're toying with my sanity!

0
Entering edit mode

sample tool looks really cool. but i have a question on samtools reheader nohead_reads_1M_sample.sam reads.bam step, nohead_reads_1M_sample.sam doesn't have header and samtool reheader is supposed to use input sam header to modify input bam file. So this line doesn't make sense, right?

0
Entering edit mode

I have tried the suggestion above with a little modification. My case is to sample 20 M reads each file on hundreds of bam files with replacement . The "sample" is the best tool.

samtools view -o –h reads.sam reads.bam #with the good header


Finally I lost some reads, because the sampling also happened at the readgroup. But with the bad sampling header, the reads_20M.sam could be coverted to the bam format.

samtools view -c reads_20M.sorted.bam
19999810

2
Entering edit mode
5.6 years ago
@Pierre Lindenbaum30

I've just written one: https://github.com/lindenb/jvarkit/wiki/Biostar145820

\$ samtools view -bu -q 60 input.bam |\

java -jar dist/biostar145820.jar -n 10  -o out.bam

0
Entering edit mode

It looks like tomorrow is going to be a day of testing bam subsampling methods! Thanks, I'll check it out.

0
Entering edit mode

Given Brian's mention of the two-pass nature of reformat.sh for extracting exact numbers of reads, I suspect that this java solution will be the fastest (excluding IO bottlenecks).

1
Entering edit mode

I don't think that's right. Glancing through the code, Pierre's solution shuffles (sorts by random indices) the entire BAM (say size N), then takes the first M. So the runtime is O(N*log(N)). The memory usage is trickier to analyze because sorting will go to disk, but it's at least O(M).

Alex's solution (and mine with shuf, assuming a new enough version of shuf) is O(N), with O(M) memory usage by reservoir sampling. And the random sampling methods (with approximate total counts) are O(N) with O(1) memory usage.

0
Entering edit mode

Indeed, I guess I hadn't looked at the code closely enough.

0
Entering edit mode

I didn't read the whole thread, but, am i wrong: when using a reservoir sampling with a small number of reads, the output file tends towards containing the last reads in the BAM ? So, if a large bam contains unmapped reads (at the end) and N=10, for example, the output will only contains unmapped reads. ?

1
Entering edit mode

With a small N and large number of reads, the probability that rand(1,i)<=N, for the ith entry in the file, will become smaller with increasing i. Actually, this is true for any N.