Hi all,
I have been using Picard's DownsampleSam
to extract random alignments from a .bam file.
The tool documentation for DownsampleSam
states that
Reads marked as not primary alignments are all discarded
However, when running DownsampleSam
(Picard release 2.4.1) on my .bam file, these 'not primary alignments' are not discarded.
For example, running samtools flagstat
on my .bam file indicates I have a total of 66427440 reads, of which 22101618 are secondary alignments.
Therefore, DownsampleSam
should only take 44325822 reads forward for downsampling, but it is downsampling from all 66427440 reads ("Kept 39866332 out of 66427440 reads (60.01%)").
Below is the command line for the above:
DownsampleSam I=input.bam O=downsampled.bam RANDOM_SEED=2 P=0.6
I then ran the above command again, but using Picard release 1.130. This time, DownsampleSam
did indeed discard the secondary alignments as expected (it is now downsampling from 44325822 reads: "Finished! Kept 35458630 out of 44325822 reads"). The reason why I am not using this release of Picard is because it does not support the STRATEGY
option, which I will need for downsampling to smaller fractions.
I was wondering if anyone has come across this kind of behaviour before? I find it strange that a more recent version is not performing as it should. Or am I missing something obvious? I would be very grateful for any suggestions on why DownsampleSam
is behaving differently between different versions. Thanks in advance!
The documentation states:
[source: https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.0.0/picard_sam_DownsampleSam.php]
This implies that secondary reads may be kept under certain circumstances.
Thank you for your reply and highlighting the above information from the document.
Following on from this, the help information (from
picard DownsampleSam -h
) for Picard release 1.130 states:This does not go into as much detail as the newer release help information. It looks like picard release 1.130 doesn't treat the secondary reads in the same way as the newer release; as it discards all secondary reads, whereas the newer release either keeps or discards all as a unit. This is a naive question, but am I correct in thinking that I should accept that all secondary reads are kept in my particular .bam file when using the newer release?
Thanks for your help!
It's quite possible that the different versions treat these reads differently. The most up to date documentation contradicts itself, as you have noticed.
On the GitHub page, the 'updated' documentation states:
[source: http://broadinstitute.github.io/picard/command-line-overview.html#Overview]
So, it's difficult to know what exactly it's doing as the documentation itself is contradictory. You can possibly follow the code trail to see what exactly it's doing: https://github.com/broadinstitute/picard/blob/master/src/main/java/picard/sam/DownsampleSam.java
Thanks for getting back, I appreciate your time.
I completely agree with you that the documentation are not clear in stating exactly how
DownsampleSam
is handling secondary reads.I had a look at the code trail as you kindly suggested. This may be due to my lack of understanding, but it looks like the code does not specifically indicate how the secondary reads are handled? The code at line 231 is:
But I think the above simply outputs how many reads have been returned (the downsampled reads) out of the number of reads seen (the reads taken forward for downsampling)? I can't find where the code defines exactly how the
getSeenCount()
is obtained. Could you please point me in the right direction in this case?I also had a look at the code trail for the specific picard release I have been using (2.4.1), which should discard all reads marked as not primary alignments (similar to the 'updated' documentation). However, again I can't find this specifically in the code trail.
From what I can see, it seems that picard release 2.4.1 should discard all secondary alignments, but for my .bam file it does not.
I hope you can answer my queries above.
I also looked through the code myself but could not find a definitive part where it states at which SAM flags Picard is looking for DownsampleSam when extracting reads. This may be a question better directed at the developers, in that case.
Absolutely, neither the documentation or the code are very clear in this regard.
Thanks for your time Kevin, your inputs have been very helpful!!
No problem. For the record, DownsampleSam is what I use for step 4 of the pipeline that I mention here: A: Best tool for variant calling
I had never considered the issues that you have found with this function, though.
That's great Kevin, thanks for the link above! It's good to know that you also use
DownsampleSam
for downsampling, as opposed to the other tools I had considered.From what I can tell, this issue I am having hasn't been mentioned before by anyone else. Out of curiosity, I downsampled a .bam file from a different study and again, the secondary reads are not being discarded. It would be good to confirm if anyone else has found this issue too, but as you have suggested, it appears this is an issue in the function itself...