Question

Downsampling Bam Files

7

Entering edit mode

13.4 years ago

Allpowerde ★ 1.3k

HI I have an insane coverage in my bamfiles (thanks to only 12way multiplexing of a tiny genomic region). Since some programs can't handle insane coverages, I would like to downsample, that is pick say 1000 aligned reads for each position randomly (E.g. with bamtools random: Select random alignments from existing BAM file(s)) but I need to keep the mates intact. Another way of putting this: I want to achieve a maximal read-depth of 1000 for each position.

Has anyone experience with downsampling? Are there any pitfalls I might not be aware of?

random bam read read • 23k views

ADD COMMENT • link updated 13.4 years ago by lh3 33k • written 13.4 years ago by Allpowerde ★ 1.3k

1

Entering edit mode

Well it is a targeted resequencing experiment with multiple overlapping PCR fragments so I have spikes of coverage going into the 100,000+ and then I have areas which hardly have any reads at all. But conceptually I'm after the functionality GATK offers with -dcov but I want to have the result written as a bam file.

ADD REPLY • link 13.4 years ago by Allpowerde ★ 1.3k

1

Entering edit mode

3 answers none chosen as correct or even upvoted! I'm only answering for the karma ;)

ADD REPLY • link 13.3 years ago by Casbon ★ 3.3k

0

Entering edit mode

When you say "insane", can you quantify the range please?

ADD REPLY • link 13.4 years ago by jvijai ★ 1.2k

score 12 · Answer 1 · 2011-07-18

12

Entering edit mode

12.8 years ago

lh3 33k

Downsampling support has been added to "samtools view" in SVN. It always keeps both ends of a pair using a very simple strategy (<10 lines of additional source code). Credit to Hideo Imamura at Sanger.

ADD COMMENT • link 12.8 years ago by lh3 33k

0

Entering edit mode

has anybody tried this? I am not able to find any documentation of this functionality in samtools view. Thanks!

ADD REPLY • link 12.5 years ago by Wen.Huang ★ 1.2k

1

Entering edit mode

For archiving's sake, the option is named 'subsampling' and can be called with the -s switch (code), see Subsampling Bam File With Samtools.

ADD REPLY • link 9.9 years ago by Leonor Palmeira 3.8k

0

Entering edit mode

picard tools has it, trying it...

ADD REPLY • link 12.5 years ago by Wen.Huang ★ 1.2k

score 3 · Answer 2 · 2010-12-17

Perhaps a lot of your reads are PCR artifacts - maybe you can consider removing them using samtools (the rmdup feature). Since your reads are paired-end, the script will consider pairs A-B and C-D to be duplicates if A and C map to the same co-ordinate and also B and D map to the same co-ordinate. As an added bonus, the script retains the pairs with the highest quality.

Even if you choose not to use this for your main downsizing, it might be a good first step to reduce the size of your BAM in a rational way.

score 2 · Answer 3 · 2010-12-16

2

Entering edit mode

13.4 years ago

Allpowerde ★ 1.3k

It seems that GATK has also a script for downsampling: randomSampleFromStream.pl

ADD COMMENT • link 13.4 years ago by Allpowerde ★ 1.3k

score 1 · Answer 4 · 2010-12-16

1

Entering edit mode

13.4 years ago

Casbon ★ 3.3k

I have done this for dealing with similar coverages. The pitfall is if you are downsampling you don't want to lose signal from either allele (this is diploid?) or any sample. 1000/12 = 83 reads on average. Naive binomial probability of missing an allele is about 1e-25, so nothing to worry about there. However, I would split the file by multiplex, downsample and rejoin to ensure equal reads of each sample

A problem you might have with this approach is that you might have high levels of noise/allelic imbalance due to polymerase errors. I would therefore prefer to use all the data wherever I can. I have been investigating chemware solutions to these problems, you might want to get in touch to dicsuss. Too late for these data.

ADD COMMENT • link 13.4 years ago by Casbon ★ 3.3k

0

Entering edit mode

Yes, using all the data is a sensible thing to do although that does not protect from polymerase biases because programs usually go for the majority feature they see anyways, that's maybe why downsampling is default in GATK (to 250 reads). Hence doing the downsampling yourself might be a better way of handling things (especially since you can downsample on each sample and then combine -- as you suggested). But my question is still what program to use for the downsampling :)

Could you go into more details on what chemware's solution offers?

ADD REPLY • link 13.4 years ago by Allpowerde ★ 1.3k