Biostar Beta. Not for public use.
Subsample BAM to fixed number of alignments
3
Entering edit mode
5.6 years ago
Daniel ♦ 3.7k
@Daniel1123

I am attempting to run a comparison a batch of alignments (bam format) to assess spread. However, some of my datasets are larger than others and I wish to subsample down to an equal number of reads per sample.

If I wanted to randomly extract 1 Million reads from a bam file is there a method to do this?

Note: I am fully aware of the samtools and picard methods which allow you to reduce by a proportion (i.e. the flag -s 0.33) but that would not result in an fixed number of reads per sample, which is what I need, but a reduced proportion _per sample_ which doesn't help.

Bam subsampling has previously been talked about it but always the proportional data reduction, not the fixed number required: https://www.biostars.org/p/44527/, https://www.biostars.org/p/76791/#76796, https://www.biostars.org/p/110041, https://broadinstitute.github.io/picard/command-line-overview.html

Edit: Also, I've come across bamtools but haven't been able to get it to work and seems to have not been updated in quite some time

bam subsampling • 6.7k views
ADD COMMENTlink
1
Entering edit mode

Is there a reason you cannot use bash like several have commented on in the posts you link to? For example, see https://www.biostars.org/p/44527/#126428 as it shows how to get 1 million randomly selected reads from your bam file.

ADD REPLYlink
0
Entering edit mode

I was hoping there would be a tool already existing without needing me to create 500gb of temporary files only to compress them again.

ADD REPLYlink
0
Entering edit mode

Well it depends on the downstream application, but you can often get around writing intermediates to disk by redirection ...

myprogram -f1 file1.bam -f2 <(samtools view file2.bam | head -1000000) -f3 file3.bam
ADD REPLYlink
0
Entering edit mode

Yeah, I see. I just expected there to be something slick as I imagine it's a common process. Thanks for the info.

ADD REPLYlink
0
Entering edit mode

reformat.sh from BBMap can accept a sample= parameter and can likely handle BAM files. Odds are good that that'll work well, but I've never tried.

ADD REPLYlink
0
Entering edit mode

BTW, if reformat.sh won't do this then I'm tempted to just write a small program that will subsample a fixed number of alignments (or pairs for PE datasets) from SAM/BAM/CRAM files using reservoir sampling, since it's simple enough to do.

ADD REPLYlink
0
Entering edit mode

I've tried out BBmap as it purported to do as you said, but got some java errors and given @matted's answer below I've just decided to stick with that as I'm not Java savy. I tried all combinations of sam and bam for ins and outs too. If of interest:

daniel-desktop:~/Raw_Data/ALD/SAMs$ ~/Local/bbmap/reformat.sh -in ES09.bam -out ES09_1m.sam samplereadstarget=1000000 mappedonly=t
java -ea -Xmx200m -cp /home/sbi6dap/Local/bbmap/current/ jgi.ReformatReads -in ES09.bam -out ES09_1m.sam samplereadstarget=1000000 mappedonly=t
Picked up JAVA_TOOL_OPTIONS: -javaagent:/usr/share/java/jayatanaag.jar 
Executing jgi.ReformatReads [-in, ES09.bam, -out, ES09_1m.sam, samplereadstarget=1000000, mappedonly=t]

Unknown parameter ES09_1m.sam
Exception in thread "main" java.lang.AssertionError: Unknown parameter ES09_1m.sam
    at jgi.ReformatReads.<init>(ReformatReads.java:168)
    at jgi.ReformatReads.main(ReformatReads.java:45)
ADD REPLYlink
0
Entering edit mode

Perhaps this wouldn't yield the error:

~/Local/bbmap/reformat.sh in=ES09.bam out=ES09_1m.sam samplereadstarget=1000000 mappedonly=t
ADD REPLYlink
0
Entering edit mode

wow, yes, I'm an idiot. Didn't read it well enough as I was trying to rush it. It looks like it's running now. Good test of this versus the shuf method below at least, as they're both going simultaneously.

Rule 1: RTFM

ADD REPLYlink
1
Entering edit mode

I'd just like to add one comment...

This command will work fine, with 2 caveats. Firstly, reformat.sh will keep paired reads together for fastq and fasta input, but NOT for sam/bam input. Second, if you have secondary alignments with asterisks instead of bases, that is exactly how they will come out (unless you set primaryonly=t).

ADD REPLYlink
0
Entering edit mode

Cool, I'd be interested in hearing the difference in time required.

Edit : Though really this should be IO bound...

ADD REPLYlink
0
Entering edit mode

Think I'm going to have to walk away from it today and I didn't put a time count on it unfortunately. Will do a test tomorrow.

ADD REPLYlink
0
Entering edit mode
Hi Daniel, Do you want to have 1M reads or 1M alignments? While sampling the latter you might need to control for the NH tag and the secondary alignment flag. Otherwise, you could extract the unique read ids, shuffle and select 1M reads and filter your BAM file with these ids.
ADD REPLYlink
4
Entering edit mode
5.6 years ago
matted 7.0k
@matted5333

This is based on some of the earlier threads mentioned in the comments, but you can use the GNU shuf utility to get a fixed number of random lines in a file. Recent versions of the tool use reservoir sampling, which will improve efficiency. You could run something like:

shuf -n 1000000 input.sam > output.sam

To get the header correctly, you could do something like:

cat <(samtools view -SH input.sam) <(samtools view -S input.sam | shuf -n 1000000) > output.sam
ADD COMMENTlink
1
Entering edit mode

Thanks, this seems a lot less onerous than when I first looked at the other options with the intermediate files. Pipes save the day. I'm still surprised that there isn't a canonical tool for this though.

For completeness, taking BAM inputs, I'm using:

cat <(samtools view -H input.bam) <(samtools view -q 255 input.bam | shuf -n 1000000) > output.sam
ADD REPLYlink
0
Entering edit mode

My guess as to why there's no existing tools to do this (as far as we know) is that random subsampling gives answers that are almost correct, in terms of the number of output reads, and is easier to implement.

If you take the sampled number of reads as a binomial random variable, the standard deviation is pretty tiny compared to the number of reads (at most 1000 reads when sampling 1M, which is 0.1% of the total). In fact, the standard deviation as a fraction of the expected number of sampled reads N is at most 1/sqrt(N). The argument might be that that 0.1% variation shouldn't cause problems, or if it does, you may already be in trouble.

ADD REPLYlink
0
Entering edit mode

reformat.sh will give an exact number of reads (or bases), but at the expense of reading the file twice. For an approximate answer, it only has to read the file once, which is faster.

ADD REPLYlink
3
Entering edit mode
4.2 years ago
@Alex Reynolds20

One approach is to:

  • Convert BAM to SAM with samtools
  • Strip the header from the parent SAM file
  • Use sample to sample 1M lines without replacement from the parent, headerless SAM, which uses reservoir sampling of line offsets to get around memory limitations in GNU shuf
  • Reheader the SAM-formatted sample
  • Recompress the reheadered, SAM-formatted sample to a BAM file

In explicit code:

$ samtools view -h -o reads.sam reads.bam
$ grep -v '^@' reads.sam > nohead_reads.sam
$ sample --sample-size 1000000 --sample-without-replacement --preserve-order nohead_reads.sam > nohead_reads_1M_sample.sam
$ samtools reheader nohead_reads_1M_sample.sam reads.bam
$ samtools view -bS nohead_reads_1M_sample.sam -o reads_1M_sample.bam

This process can be easily automated and parallelized.

ADD COMMENTlink
0
Entering edit mode

Just been having a poke around the code for your sample program. Good stuff, and quick too! That's gone straight into my /u/l/bin!

Seeing as I did have the source, though, I did reformat the help text to fit 80 characters ;) You're toying with my sanity!

ADD REPLYlink
0
Entering edit mode

sample tool looks really cool. but i have a question on samtools reheader nohead_reads_1M_sample.sam reads.bam step, nohead_reads_1M_sample.sam doesn't have header and samtool reheader is supposed to use input sam header to modify input bam file. So this line doesn't make sense, right?

ADD REPLYlink
0
Entering edit mode

I have tried the suggestion above with a little modification. My case is to sample 20 M reads each file on hundreds of bam files with replacement . The "sample" is the best tool.

samtools view -o –h reads.sam reads.bam #with the good header
sample -k 20000000 -r -p reads.sam > reads_20M.sam
samtools view -bS reads_20M.sam > reads_20M.bam
samtools reheader reads_20M.bam reads.sam | samtools sort -o reads_20M.sorted.bam

Finally I lost some reads, because the sampling also happened at the readgroup. But with the bad sampling header, the reads_20M.sam could be coverted to the bam format.

samtools view -c reads_20M.sorted.bam
19999810
ADD REPLYlink
2
Entering edit mode
5.6 years ago
@Pierre Lindenbaum30

I've just written one: https://github.com/lindenb/jvarkit/wiki/Biostar145820

$ samtools view -bu -q 60 input.bam |\

java -jar dist/biostar145820.jar -n 10  -o out.bam
ADD COMMENTlink
0
Entering edit mode

It looks like tomorrow is going to be a day of testing bam subsampling methods! Thanks, I'll check it out.

ADD REPLYlink
0
Entering edit mode

Given Brian's mention of the two-pass nature of reformat.sh for extracting exact numbers of reads, I suspect that this java solution will be the fastest (excluding IO bottlenecks).

ADD REPLYlink
1
Entering edit mode

I don't think that's right. Glancing through the code, Pierre's solution shuffles (sorts by random indices) the entire BAM (say size N), then takes the first M. So the runtime is O(N*log(N)). The memory usage is trickier to analyze because sorting will go to disk, but it's at least O(M).

Alex's solution (and mine with shuf, assuming a new enough version of shuf) is O(N), with O(M) memory usage by reservoir sampling. And the random sampling methods (with approximate total counts) are O(N) with O(1) memory usage.

ADD REPLYlink
0
Entering edit mode

Indeed, I guess I hadn't looked at the code closely enough.

ADD REPLYlink
0
Entering edit mode

I didn't read the whole thread, but, am i wrong: when using a reservoir sampling with a small number of reads, the output file tends towards containing the last reads in the BAM ? So, if a large bam contains unmapped reads (at the end) and N=10, for example, the output will only contains unmapped reads. ?

ADD REPLYlink
1
Entering edit mode

With a small N and large number of reads, the probability that rand(1,i)<=N, for the ith entry in the file, will become smaller with increasing i. Actually, this is true for any N.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.3