remove a percent of reads from a BAM file
3
2
Entering edit mode
7.7 years ago
sacha ★ 2.4k

I would like to remove x% of reads from a BAM file at a specific region . For instance : remove 50% of reads in chr4:1-1000000

What's the most efficient approach ?

BAM remove • 2.9k views
ADD COMMENT
0
Entering edit mode
7.7 years ago
Zaag ▴ 860

This should work:

 java -jar GenomeAnalysisTK.jar \
   -T PrintReads \
   -R reference.fasta \
   -I input.bam \
   -o output.bam \
   -dfrac 0.5 \
   -L chr4:1-1000000

https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_tools_walkers_readutils_PrintReads.php

ADD COMMENT
0
Entering edit mode

I prefer something using samtools / bedtools ! But thanks, I will try it

ADD REPLY
0
Entering edit mode

Maybe something like this:

samtools view -sb 0.5 in.bam chr4:1-1000000 > out.bam
ADD REPLY
0
Entering edit mode

Those commands keeps reads . I want to exclude them. I though to remove all reads in chr4:1-1000000, and then merge with your output. Unfortunally, I didn't find how to remove easily reads in the region chr4:1-1000000

ADD REPLY
1
Entering edit mode

This removes all the reads in the specified region (-XL option) to include use L, don't know a samtools equivalent:

 java -jar GenomeAnalysisTK.jar \
   -T PrintReads \
   -R reference.fasta \
   -I input.bam \
   -o output.bam \
   -XL chr4:1-1000000
ADD REPLY
0
Entering edit mode

Thanks ! it seems working. But I wonder why the reference is required.

ADD REPLY
0
Entering edit mode
7.7 years ago
stam • 0

Similarly, you can do this with Picard Tools

java -jar ~/tools/picard.jar DownsampleSam I=Input.bam o=Output.bam PROBABILITY=0.5

Or use use samtools view

samtools view -s 0.5 Input.bam > Output.bam

Picardtools has the advantage that it keeps or discards read pairs as a whole. E.g. if one mate is removed/kept, the other is as well.

ADD COMMENT
0
Entering edit mode

Seems those commands keeps reads. I want to exclude reads from a region chr4:1-1000000.

ADD REPLY
0
Entering edit mode

Randomly drawing 50% of the reads from a region is the same as randomly discarding 50% of the reads, isn't it?

Both samtools and picard allow you to specify the region as well.

ADD REPLY
0
Entering edit mode

Attached a picture of what I want :

https://snag.gy/Hyb4Cu.jpg

ADD REPLY
0
Entering edit mode
7.7 years ago
sacha ★ 2.4k

To remove reads from a bam in a particular region, I only found the python way using pysam. Here is the pseudo-code

for read in bamfile:
     if read.pos > region_start and (read.pos + read.length) < region_end:
          do_nothing()
     else:
        output.write(read)
ADD COMMENT
0
Entering edit mode

My version of pysam doesn't have read.length - what is it? If you meant len(read.seq), strictly speaking you should parse the CIGAR string as that takes into consideration insertions/deletions/skips/softclipped bases/etc. Also, obviously, read pairs will be all screwed up as stam says, but perhaps thats OK for your application.

ADD REPLY
1
Entering edit mode

I m using : read.query_length

ADD REPLY
1
Entering edit mode

Yeah, I mean, you can do that by all means, but it won't be as accurate as parsing the CIGAR strings. To be honest, for your tests it'll probably be fine. But this is what i'm talking about:

>>> b
<pysam.calignedsegment.AlignedSegment object at 0x109a18f58>
>>> print b
SOLEXA1_0001:4:87:13822:5416#0  16  0   3299480 0   8M16308N28M -1  -1  36  GAAACTAACACTCTCTCTCTCTCTCTCTCTCTCTCT    array('B', [34, 34, 34, 34, 34, 34, 34, 34, 34, 33, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34])    [('NM', 2), ('XS', '-'), ('NH', 5), ('CC', 'chr10'), ('CP', 85523284)]
>>> print b.query_length
36

So this is a read where, according to the CIGAR, the first 8bp map, then theres a 16,308bp gap, then 28bp map. To get the end position you want the POS + 8 + 16308 + 28, and not POS + 36. However, since reads like this can be rare, you might want to just add code that checks if the read has a weird CIGAR and ask you manually to accept/reject the read. Might be faster in the long run.

I also don't know how pysam handles soft-clipped reads (and for the record, this is exactly why i dislike pysam - it causes more ambiguity than it gives back in abstraction). Perhaps query_length, whatever that is, takes into account soft-clipped reads. Perhaps not. Who knows. But if it does not and you have soft-clipped adapters, you'll have to parse CIGAR strings for sure.

ADD REPLY

Login before adding your answer.

Traffic: 1953 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6