Downsampling Bam Files
4
7
Entering edit mode
13.4 years ago
Allpowerde ★ 1.3k

HI I have an insane coverage in my bamfiles (thanks to only 12way multiplexing of a tiny genomic region). Since some programs can't handle insane coverages, I would like to downsample, that is pick say 1000 aligned reads for each position randomly (E.g. with bamtools random: Select random alignments from existing BAM file(s)) but I need to keep the mates intact. Another way of putting this: I want to achieve a maximal read-depth of 1000 for each position.

Has anyone experience with downsampling? Are there any pitfalls I might not be aware of?

random bam read read • 23k views
ADD COMMENT
1
Entering edit mode

Well it is a targeted resequencing experiment with multiple overlapping PCR fragments so I have spikes of coverage going into the 100,000+ and then I have areas which hardly have any reads at all. But conceptually I'm after the functionality GATK offers with -dcov but I want to have the result written as a bam file.

ADD REPLY
1
Entering edit mode

3 answers none chosen as correct or even upvoted! I'm only answering for the karma ;)

ADD REPLY
0
Entering edit mode

When you say "insane", can you quantify the range please?

ADD REPLY
12
Entering edit mode
12.8 years ago
lh3 33k

Downsampling support has been added to "samtools view" in SVN. It always keeps both ends of a pair using a very simple strategy (<10 lines of additional source code). Credit to Hideo Imamura at Sanger.

ADD COMMENT
0
Entering edit mode

has anybody tried this? I am not able to find any documentation of this functionality in samtools view. Thanks!

ADD REPLY
1
Entering edit mode

For archiving's sake, the option is named 'subsampling' and can be called with the -s switch (code), see Subsampling Bam File With Samtools.

ADD REPLY
0
Entering edit mode

picard tools has it, trying it...

ADD REPLY
3
Entering edit mode
13.4 years ago
Bio_X2Y ★ 4.4k

Perhaps a lot of your reads are PCR artifacts - maybe you can consider removing them using samtools (the rmdup feature). Since your reads are paired-end, the script will consider pairs A-B and C-D to be duplicates if A and C map to the same co-ordinate and also B and D map to the same co-ordinate. As an added bonus, the script retains the pairs with the highest quality.

Even if you choose not to use this for your main downsizing, it might be a good first step to reduce the size of your BAM in a rational way.

ADD COMMENT
2
Entering edit mode
13.4 years ago
Allpowerde ★ 1.3k

It seems that GATK has also a script for downsampling: randomSampleFromStream.pl

ADD COMMENT
1
Entering edit mode
13.4 years ago
Casbon ★ 3.3k

I have done this for dealing with similar coverages. The pitfall is if you are downsampling you don't want to lose signal from either allele (this is diploid?) or any sample. 1000/12 = 83 reads on average. Naive binomial probability of missing an allele is about 1e-25, so nothing to worry about there. However, I would split the file by multiplex, downsample and rejoin to ensure equal reads of each sample

A problem you might have with this approach is that you might have high levels of noise/allelic imbalance due to polymerase errors. I would therefore prefer to use all the data wherever I can. I have been investigating chemware solutions to these problems, you might want to get in touch to dicsuss. Too late for these data.

ADD COMMENT
0
Entering edit mode

Yes, using all the data is a sensible thing to do although that does not protect from polymerase biases because programs usually go for the majority feature they see anyways, that's maybe why downsampling is default in GATK (to 250 reads). Hence doing the downsampling yourself might be a better way of handling things (especially since you can downsample on each sample and then combine -- as you suggested). But my question is still what program to use for the downsampling :)

Could you go into more details on what chemware's solution offers?

ADD REPLY

Login before adding your answer.

Traffic: 1591 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6