SNP calling pipeline - testing different programs
5
3
Entering edit mode
9.0 years ago

I am evaluating different possible pipelines to perform the following:

  • Map Ion Proton reads on a reference genome
  • Call SNPs
  • Filter the SNPs

Here are the programs I am testing:

Mapping:

Must be able to deal with the high occurence of indel errors in Ion Proton reads.

  • bwa mem
  • bowtie2

SNP calling

  • samtools
  • freebayes
  • GATK
  • Platypus

Filtering

  • vcftools
  • pyVCF
  • custom

For any of these steps (mapping, calling, filtering), do any of you have experience with the programs listed above? What are their strengths or weaknesses?

Are there other programs I should consider?

For now, I am not considering to use TMAP and TVC, even tough they have been developed specifically for Ion Proton data since nobody seems to be using them.

SNP mapping calling filtering • 5.2k views
ADD COMMENT
4
Entering edit mode
9.0 years ago

let me share some thoughts with you regarding variant calling on Ion data
we have been using TVC for a couple of years as the only source of variants for our Ion PGM and Ion Proton machines since we weren't able to find any specific variant caller for Ion that would give us better results. we tried to use GATK in the past because we do so with SOLiD data running it in parallel with LifeScope (we end up with 2 variant sets that we can merge and let the researcher decide whether to use just one or both of the sets), but we the number of false positives were too high at that time. in fact the number of GATK's variants were almost 10x the number of TVC's.

but we wanted to review this process, so
we are currently running very basic tests in order to compare those TVC results, considering them as our gold standard, with the 2 variant callers that we know best: GATK and samtools+bcftools. we used 8 samples which we have Sanger sequenced and we played around with several configurations (full preprocessing the bam files as the GATK best practices suggests and not doing so before calling, using both HaplotypeCaller and UnifiedGenotyper algorithms on GATK, hard filtering and not doing so) and, although our test are yet raw, I can tell you that the best results we have are the ones coming from the complete GATK pipeline that we run on SOLiD data: full bam files preprocessing before running GATK's HaplotypeCaller and hard filtering the resulting variants.

but we don't feel quite confident using not suggested software, because
the GATK's developers have always stated that their software was not designed to be used with Ion data. in fact, Geraldine has stated a couple of weeks ago that "I'm afraid this has been tabled definitively; we will not produce best practices for dealing with this data type.". but the fact is that we have a 83%-87% overlap between HC's passed variants and TVC's (other configurations reached overlaps never above 75%) and, although we haven't yet compared all this data with Sanger's results to confirm true and false positive rates, it seems that the current GATK's best practices generate a fairly robust variant set, which is even better that samtools+bcftools (this combination, without even preprocessing the bam files, surprisingly outperforms all the rest but the full GATK's best practices reaching overlaps of 80% and being much faster - ~30min vs. ~6h). in fact this pipeline is capable of detecting some single-base indels that we knew and looked for them manually, indels that even samtools+bcftools nor even TVC (that should be optimized to distinguish technology indels from real indels) are not able to detect.

so we are still torturing these variants
because we are not sure whether to stick with TVC variants only or whether to merge them with a completely unsupported yet promising variant set. also, if the computing resources were critical we could be tempted to go for the samtools+bcftools set, but since we'd like to reduce the false positives rate to its minimum we may have to invest more time on the variant calling process. it seems a lose-lose situation, but we really want to make sure that we are generating the best possible results.

ADD COMMENT
0
Entering edit mode

Thank you Jorge. This is very useful input. Would you mind sharing the TMAP/TVC and GATK commands and parameters you used, as a starting point so that others may potentially save lots of time?

ADD REPLY
1
Entering edit mode
  • We use the default TMAP settings on both Ion PGM and Proton, since we've been told by Ion that they were already optimized for target resequencing, which is what we've been doing for a while.
  • Also, we use the default TVC settings, but we lower the discover threshold (we use the lowest available for germline) since we want to lower the false negative rate to its minimum.
  • Finally, we use the settings for the GATK described in its best practices webpage.

No big deal, but at least all the decisions made can be easily referenced and reproducible.

ADD REPLY
1
Entering edit mode
9.0 years ago

I suggest BBMap as an aligner, particularly as you are using Ion Proton reads, as it is very accurate at calling mapping to and calling indels correctly.

ADD COMMENT
0
Entering edit mode

Thank you for the suggestion. BBmap looks pretty cool, but has it seen much use up to now?

Just to clarify, I DO NOT want to call the indels, as they are created by the sequencer. However, I do want to be able to align the reads containing the indels properly and then call only the SNP variants.

ADD REPLY
1
Entering edit mode

BBMap appears to be widely used, though not nearly as much as Bowtie2 and BWA, which have been available longer. It's hard to even guess the exact numbers.

As for indel calling, it's important for the aligner map and call indels accurately even if they are caused by the platform rather than mutations and you don't intend to use them after variant-calling. If an indel is not correctly called as an indel in the cigar string, it will instead be either soft-clipped, called as a series of substitutions, or force the read to be mapped in the wrong place with multiple substitutions instead of an indel. All of those incur bias.

ADD REPLY
1
Entering edit mode
9.0 years ago
Len Trigg ★ 1.6k

You might like to try the RTG suite, which has components for all stages of your pipeline, and interoperates with standard formats if you want to mix and match with other components (although sticking with RTG for the whole pipeline is more streamlined). RTG Core is commercial but free for non-commercial use, which I assume is fine since you have GATK in the list. The RTG variant calling tools have a lot of nice features you'd expect (calibration, haplotype calling of complex variants) and plenty you wouldn't (calling with pedigrees, de-novo and somatic variant detection, sex chromosome handling, etc), although if you are using the Proton I'm guessing these are smaller organisms where some of this is less important. What type of genomes are you going to be processing?

ADD COMMENT
0
Entering edit mode
9.0 years ago

Another option that was pointed to me is STAR. (EDIT: STAR is for RNA-seq, not alignment on a genome, my bad)

https://github.com/alexdobin/STAR

It seems quite new and I do not know if a lot of people have used it yet.

Any one can comment?

ADD COMMENT
0
Entering edit mode

STAR is a RNAseq aligner. I have used it extensively for RNA-seq alignment but the primary reason behind it is that it is pretty fast compared to the other RNA-seq aligners. I think I would focus on more genomic aligners such as BWA-mem and Bowtie2. I have never used BBMap but it seems to be highly sensitivity and tolerant of errors (according to Brian). So I would definitely give it a shot especially for Iontorrent data that comes with lots of homopolymer errors. Have you tried TMAP, the inbuilt aligner for Iontorrent and evaluated how it performed?

ADD REPLY
0
Entering edit mode

Ah, I overlooked the fact that STAR was dedicated to RNA-seq alignment. Thanks for pointing this out.

I left TMAP out of the equation since it seems hardly anyone uses it. I'd love hearing from somebody who tried it though!

ADD REPLY
0
Entering edit mode
8.1 years ago

You might want to filter and explore your VCF file for different kinds of analyses (diversity, haplotypes, population structure).

You can then make use of the online SNP pipeline: http://sniplay.southgreen.fr/cgi-bin/analysis_v3.cgi

ADD COMMENT

Login before adding your answer.

Traffic: 1987 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6