I have a bam file, say x.bam, with it I split into forward and reverse halves:
samtools view -F 16 x.bam -b -o x.for.bam #(command a)
samtools view -f 16 x.bam -b -o x.rev.bam #(command b)
I then check depth at certain positions, for example:
samtools depth -b my.bed x.bam | grep 12345678 #(command 1)
samtools depth -b my.bed x.for.bam | grep 12345678 #(command 2)
samtools depth -b my.bed x.rev.bam | grep 12345678 #(command 3)
For some/most positions, the number from command 2 and command 3 add up to the number of command 1; however in some positions, I got results like this:
- command 1 gets 62
- command 2 gets 2
- command 3 gets 8010
apparently 8010+2 != 62, and how come reverse only bam has more depth (8010 reads) at that position than the overall bam file (62 reads)? This does not make sense to me. Did I miss anything here?
TIA
ps:
I also checked map qual for those 8010 reads from reverse bam:
samtools depth -b my.bed -q 30 x.rev.bam | grep 12345678
, which yields 7876
reads, indicating most of the 8010 reads have good map quality.
ps2:
I opened IGV and looked the region. Funny thing is that x.bam
file does not have reads in that position, while x.rev.bam
has thousands of reads at that position. Have no idea why x.rev.bam
gets those reads, while x.bam
does not have it.
You can query on depth directly like so:
once you find a region that seems wrong investigate more with samtools view over that range. The two counts should add up.
I vaguely recall a situation where samtools count and samtools depth gave radically different answers over a region on the account of including (or not) the unmapped reads.
Thank you Istvan Albert
depth with region command you gave generated the same discrepancy I saw.
I opened IGV and looked the region. Funny thing is that x.bam file does not have reads in that position, while x.rev.bam has thousands of reads at that position. Have no idea why x.rev.bam gets those reads, while x.bam does not have it.
Well you can evaluate it directly on the original file
this will return the number of sequences that have coordinates in a region but will include the unmapped reads (Some aligners may fill in the chromosomal coordinate yet still consider the read unmapped!).
Edit: These latter two should match with depth: