Question: Subsample BAM to fixed number of alignments
3
Entering edit mode

I am attempting to run a comparison a batch of alignments (bam format) to assess spread. However, some of my datasets are larger than others and I wish to subsample down to an equal number of reads per sample.

If I wanted to randomly extract 1 Million reads from a bam file is there a method to do this?

Note: I am fully aware of the samtools and picard methods which allow you to reduce by a proportion (i.e. the flag -s 0.33) but that would not result in an fixed number of reads per sample, which is what I need, but a reduced proportion _per sample_ which doesn't help.

Bam subsampling has previously been talked about it but always the proportional data reduction, not the fixed number required: https://www.biostars.org/p/44527/, https://www.biostars.org/p/76791/#76796, https://www.biostars.org/p/110041, https://broadinstitute.github.io/picard/command-line-overview.html

Edit: Also, I've come across bamtools but haven't been able to get it to work and seems to have not been updated in quite some time

ADD COMMENTlink 4.6 years ago Daniel ♦ 3.7k • updated 4.6 years ago Alex Reynolds 28k
Entering edit mode
1

Is there a reason you cannot use bash like several have commented on in the posts you link to? For example, see https://www.biostars.org/p/44527/#126428 as it shows how to get 1 million randomly selected reads from your bam file.

ADD REPLYlink 4.6 years ago
alolex
• 890
Entering edit mode
0

I was hoping there would be a tool already existing without needing me to create 500gb of temporary files only to compress them again.

ADD REPLYlink 4.6 years ago
Daniel
♦ 3.7k
Entering edit mode
0

Well it depends on the downstream application, but you can often get around writing intermediates to disk by redirection ...

myprogram -f1 file1.bam -f2 <(samtools view file2.bam | head -1000000) -f3 file3.bam
ADD REPLYlink 4.6 years ago
george.ry
♦ 1.1k
Entering edit mode
0

Yeah, I see. I just expected there to be something slick as I imagine it's a common process. Thanks for the info.

ADD REPLYlink 4.6 years ago
Daniel
♦ 3.7k
Entering edit mode
0

reformat.sh from BBMap can accept a sample= parameter and can likely handle BAM files. Odds are good that that'll work well, but I've never tried.

ADD REPLYlink 4.6 years ago
Devon Ryan
90k
Entering edit mode
0

BTW, if reformat.sh won't do this then I'm tempted to just write a small program that will subsample a fixed number of alignments (or pairs for PE datasets) from SAM/BAM/CRAM files using reservoir sampling, since it's simple enough to do.

ADD REPLYlink 4.6 years ago
Devon Ryan
90k
Entering edit mode
0

I've tried out BBmap as it purported to do as you said, but got some java errors and given @matted's answer below I've just decided to stick with that as I'm not Java savy. I tried all combinations of sam and bam for ins and outs too. If of interest:

daniel-desktop:~/Raw_Data/ALD/SAMs$ ~/Local/bbmap/reformat.sh -in ES09.bam -out ES09_1m.sam samplereadstarget=1000000 mappedonly=t
java -ea -Xmx200m -cp /home/sbi6dap/Local/bbmap/current/ jgi.ReformatReads -in ES09.bam -out ES09_1m.sam samplereadstarget=1000000 mappedonly=t
Picked up JAVA_TOOL_OPTIONS: -javaagent:/usr/share/java/jayatanaag.jar 
Executing jgi.ReformatReads [-in, ES09.bam, -out, ES09_1m.sam, samplereadstarget=1000000, mappedonly=t]

Unknown parameter ES09_1m.sam
Exception in thread "main" java.lang.AssertionError: Unknown parameter ES09_1m.sam
    at jgi.ReformatReads.<init>(ReformatReads.java:168)
    at jgi.ReformatReads.main(ReformatReads.java:45)
ADD REPLYlink 4.6 years ago
Daniel
♦ 3.7k
Entering edit mode
0

Perhaps this wouldn't yield the error:

~/Local/bbmap/reformat.sh in=ES09.bam out=ES09_1m.sam samplereadstarget=1000000 mappedonly=t
ADD REPLYlink 4.6 years ago
Devon Ryan
90k
Entering edit mode
0

wow, yes, I'm an idiot. Didn't read it well enough as I was trying to rush it. It looks like it's running now. Good test of this versus the shuf method below at least, as they're both going simultaneously.

Rule 1: RTFM

ADD REPLYlink 4.6 years ago
Daniel
♦ 3.7k
Entering edit mode
1

I'd just like to add one comment...

This command will work fine, with 2 caveats. Firstly, reformat.sh will keep paired reads together for fastq and fasta input, but NOT for sam/bam input. Second, if you have secondary alignments with asterisks instead of bases, that is exactly how they will come out (unless you set primaryonly=t).

ADD REPLYlink 4.6 years ago
Brian Bushnell
16k
Entering edit mode
0

Cool, I'd be interested in hearing the difference in time required.

Edit : Though really this should be IO bound...

ADD REPLYlink 4.6 years ago
Devon Ryan
90k
Entering edit mode
0

Think I'm going to have to walk away from it today and I didn't put a time count on it unfortunately. Will do a test tomorrow.

ADD REPLYlink 4.6 years ago
Daniel
♦ 3.7k
Entering edit mode
0
Hi Daniel, Do you want to have 1M reads or 1M alignments? While sampling the latter you might need to control for the NH tag and the secondary alignment flag. Otherwise, you could extract the unique read ids, shuffle and select 1M reads and filter your BAM file with these ids.
ADD REPLYlink 4.6 years ago
michael.ante
♦ 3.3k
4
Entering edit mode

This is based on some of the earlier threads mentioned in the comments, but you can use the GNU shuf utility to get a fixed number of random lines in a file. Recent versions of the tool use reservoir sampling, which will improve efficiency. You could run something like:

shuf -n 1000000 input.sam > output.sam

To get the header correctly, you could do something like:

cat <(samtools view -SH input.sam) <(samtools view -S input.sam | shuf -n 1000000) > output.sam
ADD COMMENTlink 4.6 years ago matted 7.0k
Entering edit mode
1

Thanks, this seems a lot less onerous than when I first looked at the other options with the intermediate files. Pipes save the day. I'm still surprised that there isn't a canonical tool for this though.

For completeness, taking BAM inputs, I'm using:

cat <(samtools view -H input.bam) <(samtools view -q 255 input.bam | shuf -n 1000000) > output.sam
ADD REPLYlink 4.6 years ago
Daniel
♦ 3.7k
Entering edit mode
0

My guess as to why there's no existing tools to do this (as far as we know) is that random subsampling gives answers that are almost correct, in terms of the number of output reads, and is easier to implement.

If you take the sampled number of reads as a binomial random variable, the standard deviation is pretty tiny compared to the number of reads (at most 1000 reads when sampling 1M, which is 0.1% of the total). In fact, the standard deviation as a fraction of the expected number of sampled reads N is at most 1/sqrt(N). The argument might be that that 0.1% variation shouldn't cause problems, or if it does, you may already be in trouble.

ADD REPLYlink 4.6 years ago
matted
7.0k
Entering edit mode
0

reformat.sh will give an exact number of reads (or bases), but at the expense of reading the file twice. For an approximate answer, it only has to read the file once, which is faster.

ADD REPLYlink 4.6 years ago
Brian Bushnell
16k
3
Entering edit mode

One approach is to:

  • Convert BAM to SAM with samtools
  • Strip the header from the parent SAM file
  • Use sample to sample 1M lines without replacement from the parent, headerless SAM, which uses reservoir sampling of line offsets to get around memory limitations in GNU shuf
  • Reheader the SAM-formatted sample
  • Recompress the reheadered, SAM-formatted sample to a BAM file

In explicit code:

$ samtools view -h -o reads.sam reads.bam
$ grep -v '^@' reads.sam > nohead_reads.sam
$ sample --sample-size 1000000 --sample-without-replacement --preserve-order nohead_reads.sam > nohead_reads_1M_sample.sam
$ samtools reheader nohead_reads_1M_sample.sam reads.bam
$ samtools view -bS nohead_reads_1M_sample.sam -o reads_1M_sample.bam

This process can be easily automated and parallelized.

ADD COMMENTlink 3.2 years ago Alex Reynolds 28k
Entering edit mode
0

Just been having a poke around the code for your sample program. Good stuff, and quick too! That's gone straight into my /u/l/bin!

Seeing as I did have the source, though, I did reformat the help text to fit 80 characters ;) You're toying with my sanity!

ADD REPLYlink 4.6 years ago
george.ry
♦ 1.1k
Entering edit mode
0

sample tool looks really cool. but i have a question on samtools reheader nohead_reads_1M_sample.sam reads.bam step, nohead_reads_1M_sample.sam doesn't have header and samtool reheader is supposed to use input sam header to modify input bam file. So this line doesn't make sense, right?

ADD REPLYlink 3.0 years ago
epigene
• 450
Entering edit mode
0

I have tried the suggestion above with a little modification. My case is to sample 20 M reads each file on hundreds of bam files with replacement . The "sample" is the best tool.

samtools view -o –h reads.sam reads.bam #with the good header
sample -k 20000000 -r -p reads.sam > reads_20M.sam
samtools view -bS reads_20M.sam > reads_20M.bam
samtools reheader reads_20M.bam reads.sam | samtools sort -o reads_20M.sorted.bam

Finally I lost some reads, because the sampling also happened at the readgroup. But with the bad sampling header, the reads_20M.sam could be coverted to the bam format.

samtools view -c reads_20M.sorted.bam
19999810
ADD REPLYlink 9 months ago
zhaolin20042008
• 0
2
Entering edit mode

I've just written one: https://github.com/lindenb/jvarkit/wiki/Biostar145820

$ samtools view -bu -q 60 input.bam |\

java -jar dist/biostar145820.jar -n 10  -o out.bam
ADD COMMENTlink 4.6 years ago Pierre Lindenbaum 120k
Entering edit mode
0

It looks like tomorrow is going to be a day of testing bam subsampling methods! Thanks, I'll check it out.

ADD REPLYlink 4.6 years ago
Daniel
♦ 3.7k
Entering edit mode
0

Given Brian's mention of the two-pass nature of reformat.sh for extracting exact numbers of reads, I suspect that this java solution will be the fastest (excluding IO bottlenecks).

ADD REPLYlink 4.6 years ago
Devon Ryan
90k
Entering edit mode
1

I don't think that's right. Glancing through the code, Pierre's solution shuffles (sorts by random indices) the entire BAM (say size N), then takes the first M. So the runtime is O(N*log(N)). The memory usage is trickier to analyze because sorting will go to disk, but it's at least O(M).

Alex's solution (and mine with shuf, assuming a new enough version of shuf) is O(N), with O(M) memory usage by reservoir sampling. And the random sampling methods (with approximate total counts) are O(N) with O(1) memory usage.

ADD REPLYlink 4.6 years ago
matted
7.0k
Entering edit mode
0

Indeed, I guess I hadn't looked at the code closely enough.

ADD REPLYlink 4.6 years ago
Devon Ryan
90k
Entering edit mode
0

I didn't read the whole thread, but, am i wrong: when using a reservoir sampling with a small number of reads, the output file tends towards containing the last reads in the BAM ? So, if a large bam contains unmapped reads (at the end) and N=10, for example, the output will only contains unmapped reads. ?

ADD REPLYlink 4.6 years ago
Pierre Lindenbaum
120k
Entering edit mode
1

With a small N and large number of reads, the probability that rand(1,i)<=N, for the ith entry in the file, will become smaller with increasing i. Actually, this is true for any N.

ADD REPLYlink 4.6 years ago
Devon Ryan
90k

Login before adding your answer.

Powered by the version 1.8