Question

Detection of deletion for mitochondrial NGS analysis at position 3107

0

Entering edit mode

6.4 years ago

NB ▴ 960

Hello,

Has anyone worked on mtDNA NGS analysis ? I'm developing an inhouse pipeline using the rCRS reference. The pipeline is as follows

1) BWA map to reference; Picard/Sambamba sort, index mark duplicates

2) GATK base recalibration

3) Samtools mpileup followed by Varscan for variant calling (strand filter 0 min-var-freq 0.01 --min-avg-qual 20 --min-coverage 50 )

We know that rCRS reference has an artefact at position 3107 and should be detected when aligned and also should be homzygous with high heteroplasmy levels. But using the above steps, 3107 does not get detected in all samples and if it does, it classifies as heterozygous with low levels of heteroplasmy (less than 15%)

Has anyone encountered such issues or can suggest ways to improve the pipeline ?

I know this artefact is usually excluded from the reference but in this particular case, we do not exclude it but classify it as an artefact.

thank you

mitochondria NGS alignment mtDNA • 2.7k views

ADD COMMENT • link updated 6.4 years ago by Gabriel R. ★ 2.9k • written 6.4 years ago by NB ▴ 960

0

Entering edit mode

Hey Nandini, I used mPileup + varscan as you suggested. I still have some questions/problems want to ask you:

Did you use varscan mpileup2cns command? I used this and I found this only call one mutation frequency for me each position even there are more than 1 variant at that site. Check here for detail description.
So I tried GATK HaplotypeCaller but still have some problems. I wonder have you think about treat mtDNA as tumor and use varscan Tumor-normal Comparison set rCRS as normal tissue data, or use Mutect tool from GATK. I am think about this because mtDNA and tumor both has heterogeneity. However tumor has more other features different from mtDNA like more commonly structure variants. Thanks.

ADD REPLY • link 5.6 years ago by MatthewP ★ 1.4k

0

Entering edit mode

Hi MatthewP, So I use varscan mpileup2SNP and mpileup2indel and then combine the two result files, followed by annotation. If you are using GATK, then I THINK you can use GATK's Mutect rather than Haplotypecaller (though I do not have much experience with GATK and mtDNA analysis)

ADD REPLY • link 5.6 years ago by NB ▴ 960

0

Entering edit mode

So can you get 1 frequency value per variant instead of per site(POS?) I really need frequency for each variant(Clients need mtDNA heterogeneity information). I already find HaplotypeCaller not suite for mtDNA data(because amplicon-based and haploid).

ADD REPLY • link 5.6 years ago by MatthewP ★ 1.4k

score 0 · Answer 1 · 2017-11-22

0

Entering edit mode

6.4 years ago

Jeremy Leipzig 22k

I suppose by "detected" you mean some base in the read aligned to 3107, which is a 'N' (due to historical legacy reasons)

AGTAATCCAGGTCGGTTTCTATCTACNTTCAAATTCCTCCCTGTACGAAAGGACAAGAGAAATAAGGCCT

Can you show us a pileup where something aligns to it? If this is troubling I suppose you could look for an aligner which is more sensitive like Novoalign or Stampy, in the hope they'll be more likely to generate an indel there.

ADD COMMENT • link 6.4 years ago by Jeremy Leipzig 22k

0

Entering edit mode

So I tried to remove GATK base recalibration and I can see that this has been picked up but still the heteroplasmy level is less than 10% for some reason

This is the mpileup snapshot for some samples Sample1:

MT  3107    N   12  TTTTTTTTTt$t^SC <<<<<<<<<<D<

Sample2:

MT  3107    N   149 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTt$t$t$t$t$t$t$tt$tTTTttt$TTttCt^SC^SC^HC^SC^SC^SC^SC^SC^SC^SC^SC^SC^SC^SC^OC^SC^SC^SC^SC^OC^SC^OC^SC^SC^SC^Sc^Sc^Sc^Sc^Sc^Oc^Sc^Sc^Sc  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<8<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<5<<<<<<<<<<<<<<<<<<<<<.<.B<<G.EF.<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Sample3:

MT  3107    N   232 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTtt$t$t$t$ttt$t$tttt$tt$t$c$TTTTTtt$Gt$t$TTTaT^SC^SC^SC^SC^SC^SC^OC^SC^SC^SC^SC^SC^SC^SC^SC^SC^MC^SC^SC^SC^SC^SC^SC^SC^SC^SC^SC^SC^SC^OC^SC^RC^SC^MC^SC^SC^SC^SC^SC^SC^SC^SC^OC^SC^SC^SC^SC^NC^SC^SC^SC^SC^SC^SC^SC^Sc^Hc^Sc^Sc^Oc^Sc^Mc^Sc^Sc^Sc^Sc^Oc^Sc^Sc^Sc^Sc^Mc^Sc^Sc^Sc^Sc^Sc^Sc   <<<<<<5<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<:<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<@<<<<<<BG<<<<D<6<<<G/<E<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

ADD REPLY • link 6.4 years ago by NB ▴ 960

0

Entering edit mode

so in the instances where a 'T' is assigned is there a gap further downstream?

ADD REPLY • link 6.4 years ago by Jeremy Leipzig 22k

1

Entering edit mode

@jeremy no, there isnt'. So I tried a few things. We were initial mapping the mt sequences to whole genome GrCh37. Now what I did, is just map it to MT rather than the whole genome. Also, I removed the step of GATK recalibration. Now I just map it to bwa , followed by marking duplicates and variant calling. For some reason, that seems to work . Although it calls the variant at 3105 as a deletion AC -> C, but I think thats fine. But I'm still not being able to reason it out why would using a specific reference make so much difference, while the variant was present in bam files either ways.

ADD REPLY • link 6.4 years ago by NB ▴ 960

score 0 · Answer 2 · 2017-11-22

0

Entering edit mode

6.4 years ago

Gabriel R. ★ 2.9k

A few comments, I found that SHRiMP: compbio.cs.toronto.edu/shrimp/ is much better to handle short indels. I have my own little wrapper if you want to check out my pipeline: https://github.com/grenaud/schmutzi/blob/master/wrapper.pl

It also uses a module to handle circularity to get the edges called correctly.

GATK recalibration? Like how will this work? Also are you mapping to the mt in isolation?

ADD COMMENT • link 6.4 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

Hi Gabriel, I wasn't sure if it was appropriate or not to use GATK recalibration. So these samples are analysed specific only for mtDNA for diagnostic environment. Does SHRiMP perform better than novoalign ? I know novoalign was quite good but we had to change to BWA as the license had to be renewed

ADD REPLY • link 6.4 years ago by NB ▴ 960

0

Entering edit mode

I know how GATK does recal for nuclear data, I do not know how or even if this is possible on mt data. I do not know about novoalign but SHRiMP is free and open-source and I normally use BWA for nuclear and SHRiMP for mt. SHRiMP has a more intricate algorithm.

another question, how do you call mpileup for the haploid sample?

ADD REPLY • link 6.4 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

We use Varscan2 that takes in mpileup as its input and generates a vcf file. What do you use for variant calling ?

ADD REPLY • link 6.4 years ago by NB ▴ 960

0

Entering edit mode

Well if you use mpileup, you should inform it that you use a haploid model. We tend to use endoCaller to call a consensus: https://github.com/grenaud/schmutzi/blob/master/endoCaller.cpp from https://grenaud.github.io/schmutzi/ it is designed for damaged/contaminated ancient DNA fragments but can work for modern DNA using default parameters.

ADD REPLY • link 6.4 years ago by Gabriel R. ★ 2.9k