Question

Tetraploid Snp Calling & Snp Filtering

7

Entering edit mode

10.2 years ago

pavenhuizen ▴ 90

Dear all,

I'm currently working on a project involving SNP calling of resistance genes (R-genes) in 96 potato cultivars. We are interested in identifying SNPs (and INDELs) in the NB domain. The NB domains were enriched using PCR and Illumina paired-end read libraries were created for each cultivar. After quality checking (adapter trimming and read quality trimming) the reads were mapped against the potato DM v4.03 genome using NextGenMap. SNP calling is (to be) performed on the known NB-region coordinates.

And now is it time for my question, how should we do the SNP calling? Potato is a tetraploid organism, so theoretically using samtools mpileup should not cover all SNPs/alleles, because (correct me if I'm wrong) samtools is designed for diploid organisms. After a Google search I found three SNP callers (QualitySNPng, freebayes and UNEAK) which should be able to call SNPs in polyploid organisms. My question to the community is if anyone has experience in polyploid (tetraploid) SNP calling and if there is a recommended SNP caller (or if they all behave similarly), or that we maybe should only call SNPs which are called by multiple SNP callers.

Furthermore, we are uncertain about which parameters to use in SNP calling and filtering. The major problem we face is that we are uncertain when we can actually call a SNP; what allele fraction should we use? Or should we call a SNP if it has at least x (high quality) reads supporting it? And what quality score (QUAL as given in the VCF output format, or any other quality measure) is sufficiently high enough to call a SNP with high confidence?

So far we haven't been able to find any satisfying answers to these questions and are therefore uncertain how to proceed. Thanks in advance for anyone taking the time to read this and to anyone who is willing to help us with our problems.

snp calling samtools quality • 8.9k views

ADD COMMENT • link updated 10.2 years ago by Adrian Pelin ★ 2.6k • written 10.2 years ago by pavenhuizen ▴ 90

0

Entering edit mode

Hi, I was wondering if you found answers to the following your questions?

what allele fraction should we use? Or should we call a SNP if it has at least x (high quality) reads supporting it? And what quality score (QUAL as given in the VCF output format, or any other quality measure) is sufficiently high enough to call a SNP with high confidence?

Would be very useful, as I am also working with a tetraplpid species-. Many thanks

ADD REPLY • link 6.2 years ago by User000 ▴ 690

0

Entering edit mode

Unfortunately I cannot provide you with the answers. I have not been part of this project for almost 4 years now and have not continued in any fields related to tetraploid SNP calling.

ADD REPLY • link 6.2 years ago by pavenhuizen ▴ 90

score 3 · Answer 1 · 2014-02-26

3

Entering edit mode

10.2 years ago

Adrian Pelin ★ 2.6k

My advise is to remove adapters but do not do quality trimming. My next advice is to use bwa, as it is a very well known tool and easy to use.

As for your main question, I myself have experience in using FreeBayes on a tetraploid like organism.

Here is a post discussing quality of SNP calls in FreeBayes: Filters/Quality for vcf generated by FreeBayes (Erik Garrison the author of FreeBayes provided some input in that topic)

ADD COMMENT • link 10.2 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

Thanks for your fast reply and for the link to that thread, I think this will at least give me some pointers! Since you mention you have experience with tetraploid SNP calling, can you maybe tell me some more how you called the SNPs. Did you look at the allele fractions or allele count, like I asked in my original question, or did you look at the DP and QUAL like advised by Erik Garrison in the thread you mentioned, or any other factors?

ADD REPLY • link 10.2 years ago by pavenhuizen ▴ 90

1

Entering edit mode

freebayes generates a VCF file with all calls. VCF has a tag called AF, which tells you the estimated allele frequency based on the amount of ploidy you specified. So if you specify ploidy to be 4, it will decide if the allele is either at 1.0, 0.75, 0.5 or 0.25 with respect to the reference. AF=1.0 means it's a homozygous SNP.

My workflow starts with bwa alignment and ends with freebayes calling. You need picard tools for this:

bwa index REF.fasta
bwa aln -t 8 -n 0.1 REF.fasta /shared/reads/SAMP-18/RAW/121221_SND104_A_L005_GDR-18_R1.NoAdaptPR.fastq > GDR18_1.sai
bwa aln -t 8 -n 0.1 REF.fasta /shared/reads/SAMP-18/RAW/121221_SND104_A_L005_GDR-18_R2.NoAdaptPR.fastq > GDR18_2.sai
bwa sampe REF.fasta GDR18_1.sai GDR18_2.sai /shared/reads/SAMP-18/RAW/121221_SND104_A_L005_GDR-18_R1.NoAdaptPR.fastq /shared/reads/SAMP-18/RAW/121221_SND104_A_L005_GDR-18_R2.NoAdaptPR.fastq > GDR-18.sam
samtools view -Sb GDR-18.sam > GDR-18.bam
samtools sort GDR-18.bam GDR-18.sort
java -jar ~/programs/picard-tools-1.94/AddOrReplaceReadGroups.jar I=GDR-18.sort.bam O=GDR-18.sort.grp.bam LB=whatever PL=illumina PU=whatever SM=whatever VALIDATION_STRINGENCY=LENIENT
java -jar ~/programs/picard-tools-1.94/MarkDuplicates.jar INPUT=GDR-18.sort.grp.bam OUTPUT=GDR-18.sort.grp.md.bam METRICS_FILE=GDR18.duplicates REMOVE_DUPLICATES=TRUE ASSUME_SORTED=TRUE VALIDATION_STRINGENCY=LENIENT
freebayes --fasta-reference REF.fasta --ploidy 4 --pooled-continuous --min-coverage 5 -F 0.1 --vcf GDR-18.vcf GDR-18.sort.grp.md.bam

If any step doesn't make any sense let me know. With freebayes, I specify -F 0.1 because some of my samples have a very low nucleotide coverage (18x), and that's why min coverage is 5, normally it would be 10.

Adrian

ADD REPLY • link 10.1 years ago by Adrian Pelin ★ 2.6k