Question

Samtools: merge and mpileup vs mpileup alone for variant-calling with multiple BAM

2

Entering edit mode

7.2 years ago

Rob ▴ 150

Hi, I'm doing some variant-calling, and I was wondering what is the difference between:

merging BAM files with samtools merge and call samtools mpileup on the merged BAM

and

skip the merge step and call mpileup with all the BAM files?

I did both and expected the same results, but I got 15% of identical SNPs, 60% close to identical but lower quality with the samtools merge step, and the others SNPs are not find by one or the other method.

Is it possible that the quality filter applied in mpileup affect each file separately and cause this behaviour or is it something else?

samtools variant-calling merge mpileup SNP • 7.0k views

ADD COMMENT • link updated 6.2 years ago by Biostar 20 • written 7.2 years ago by Rob ▴ 150

2

Entering edit mode

More coverage gives greater confidence in variant calls, and decreases the likelihood that only a single allele will be present for heterozygous variants. I would not expect merging several 100x coverage datasets to change the result much, but merging several 5x coverage datasets will yield an extremely large difference.

ADD REPLY • link 7.2 years ago by Brian Bushnell 20k

0

Entering edit mode

But since the RG tags are identical for one sample, doesn't mpileup merge files itself? And so the coverage should be identical compared to the step with merge no?

ADD REPLY • link 7.2 years ago by Rob ▴ 150

0

Entering edit mode

that is the part that i don't know if mpileup treats 2 sets of RG tags as if you have passed 2 separate files to mpileup, generating 1 variant call file, but with calls for each sample. Best way just to it until getting vcf and see how many individuals you get in the end

ADD REPLY • link 7.2 years ago by stolarek.ir ▴ 700

0

Entering edit mode

The strange things with this is that I have more coverage on a single SNPs without merge the files before. Here one example (same for all SNPs common in both methods) :

                   Ref   Alt     QUAL     GT     DP
Without merge:      T     C    48.0158    0/1    26
With merge   :      T     C    36.8363    0/1    23

ADD REPLY • link 7.2 years ago by Rob ▴ 150

0

Entering edit mode

This VCF has only one sample? Hmmm, looks to me like something strange is going on.

ADD REPLY • link 7.2 years ago by Brian Bushnell 20k

0

Entering edit mode

Yep, only one sample (same RG tag for each Lane)

ADD REPLY • link 7.2 years ago by Rob ▴ 150

0

Entering edit mode

What you are describing should not happen. But at least, please make sure you are using the latest versions of all the software (I assume that would be samtools and bcftools); that can often resolve issues. It might also help if you could post the exact command lines you used in each case.

ADD REPLY • link 7.2 years ago by Brian Bushnell 20k

0

Entering edit mode

that reminds me also of my old problem a bit using samtools 1.19 and bcftools 1.2. Either use both of the 1.19 or both at 1.2 versions <- you can also experiment with that, and some crazy bat crap is happening sometimes

ADD REPLY • link 7.2 years ago by stolarek.ir ▴ 700

0

Entering edit mode

More coverage gives greater confidence in variant calls, and decreases the likelihood that only a single allele will be present for heterozygous variants. I would not expect merging several 100x coverage datasets to change the result much, but merging several 5x coverage datasets will yield an extremely large difference.

I would just like to question this: our empirical data suggests that increased depth of coverage does not increase confidence in the calls - it is a common misconception that more is better, in terms of variant calling. Due to the fact that NGS reads contain so much error, increase depth of coverage can, I believe, actually make it more difficult for the variant callers to make an accurate call.

Our data suggests that ideal call range for germline variants is 18-30 read depth.

ADD REPLY • link 5.1 years ago by Kevin Blighe 87k

score 2 · Answer 1 · 2017-02-17

2

Entering edit mode

7.2 years ago

stolarek.ir ▴ 700

if you just did merge on BAMs, that just adds up coverage and mpileup has no idea, that it's using two files. Depending on how low the coverage was on individual files, the results will vary, though most of the SNP calls should not change <- but again coverage of the individual files

ADD COMMENT • link 7.2 years ago by stolarek.ir ▴ 700

0

Entering edit mode

But what mpileup do with the 2 files informations? Since it's the same sample (with identical @RG tag), does not it merge them together too?

ADD REPLY • link 7.2 years ago by Rob ▴ 150

1

Entering edit mode

Never really did it, but it could treat them as separate samples if when merging you applied RG tags, and then didn't ignore them at mpileup, but you should google that, as I'm not sure

ADD REPLY • link 7.2 years ago by stolarek.ir ▴ 700

0

Entering edit mode

I will try more research, because yeah, that is the information I'm looking for :) So which way do you personally suggest as the best? Merge first sample with same RG tags, and do variant calling on the merged file?

ADD REPLY • link 7.2 years ago by Rob ▴ 150

0

Entering edit mode

anytime you have more data, and all previous steps are exactly the same <- you are not trying to compare anything between those 2 technical replicates, then just merge and call on that one big file like you did

ADD REPLY • link 7.2 years ago by stolarek.ir ▴ 700

0

Entering edit mode

Hi Rob, my advice is to not merge the BAMs and to instead pass them separately to samtools mpileup. I see no reason to merge. Take a look at my workflow here where I actually downsample each sample and then pass a total of 4 BAMs to mpileup for each sample: https://github.com/kevinblighe/ClinicalGradeDNAseq

SAMtools treats each as separate lines of evidence going into each variant call.

ADD REPLY • link 6.2 years ago by Kevin Blighe 87k

0

Entering edit mode

Kevin,

Can you provide some context/detail. The original post (and this is an issue we have as well) is about whether the variant calling behaves differently depending on whether all the samples (with appropriate RG info) are merged into a single BAM prior to running samtools mpileup, or having each individual BAM file inputted together as a list in the call to samtools mpileup. As the post demonstrated this results in a difference. The documentation is a bit thin on this, but in their own workflow, they do seem to suggest it http://www.htslib.org/workflow/#mapping_to_variant

Have you experimentally verified that for your data you are getting the same thing? Based on my reading, I would think you would do better (in particular with samples with lower individual coverage) to have them all merged prior. I look forward to your thoughts.

ADD REPLY • link 6.1 years ago by ian.dworkin • 0

0

Entering edit mode

I have not replicated this exactly but what is happening is the following: When you merge the samples (by RG) into a single BAM and then call variants on this, the downsampling will only look so far into the reads and will then make a variant call. However, when you pass the BAMs separately to samtools mpileup, it will perform the downsampling and variant calling separately on each BAM and then take a look at the consensus calls.

The pitfall to passing multiple BAMs separately is that you increase the likelihood of missing singletons, i.e., variants present in just a single sample.

ADD REPLY • link 5.1 years ago by Kevin Blighe 87k