Question

Skipping The Local Realignment Around The Indels And Rather Aligning The Mapped Reads On The Exonic Region.

0

Entering edit mode

10.4 years ago

ivivek_ngs ★ 5.2k

I have a query that we do the local realignment around the indels with the known sites because it identifies most parsimonious alignment along all reads at a problematic locus. InDels in reads among the near ends can cause the mappers into mis-aligning with mismatches. These artifactual mismatches can harm base quality recalibration and variant detection.

However if one is not interested in performing this step as one might not be interested in the InDel assessment downstream but if one wants a local realignment of the mapped reads around the exonic regions using the bed files containing the probes that was provided by the company to check the quality of reads that mapped only to the exonic regions. Can this be done? I am particularly interested in trying to see how many of my mapped reads which aligned to the whole hg19 genome is actually getting mapped on the exonic region, then we can assess the coverage of the reads in that way.

I believe we can do this also by using the DepthofCoverage walker and use it on the local indel realigned file along with the target interval bed file which captures the probe for the exonic regions to find out the statistics of the mapping. But if we skip the local realignment around the indels and rather perform alignment around the exonic region with the bed files before variant calling can we do that? Is there any walker in the GATK 2.3 which can perform this task? Or are there any other tool which can be used to perform this step just aligning the mapped reads to the exonic region and then do a quality check i mean the read counts that pertain to only the exonic region as this will give me more robust variant calling downstream and I will not have to be worried at the end of losing out a lot of reads that did not get excluded owing to regions of introns in the reference genome. If anyone has perform such kind of analysis please advice and what tool has been used to do this? be it the SAMTOOLs or the GATK.

Since am using the GATK , I would prefer some walker in the GATK that can perform that task, please suggest.

Thanks

exome-sequencing gatk • 6.7k views

ADD COMMENT • link updated 9.1 years ago by Charles Warden 8.2k • written 10.4 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

If you don't want local realignment then don't do it. Many variant callers (most?) don't do realignment at all (samtools for example), plus that process only makes a difference for indels anyhow. If you want to map against exome, then use that as your target database, his makes a difference only if you are using a prepackaged pipeline software that strictly requires certain naming/files to be present.

ADD REPLY • link 10.4 years ago by Istvan Albert 100k

0

Entering edit mode

But the target ref gene for hg19 is the whole genome. Is it possible to retrieve only exonic regions of the hg19 somehow and then align the reads with bwa/bowtie to that but sadly am using GATK which will restrict my downstream analysis as it will not work for that kind of target exomes, it only works for the genome wide. It would be highly appreciable if you have performed such analysis or not. let me know. Many thanks for the reply

ADD REPLY • link 10.4 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

that is a different question, I would post that separately, there are many ways to extract the sequences for genes. Retrieve the sequences from Ensemble, use a masked the genome, extract sequences with a command line tool. It all depends on what you are familiar with.

ADD REPLY • link 10.4 years ago by Istvan Albert 100k

score 2 · Answer 1 · 2015-04-01

I've personally found that the indel realignment can important when using the GATK UnifiedGenotyper, but it wasn't as important for other variant callers (including the GATK HaplotypeCaller) and it wasn't as important as the quality score recalibration (which, again, was mostly important for the UnifiedGenotyper):

https://peerj.com/articles/600/

score 1 · Answer 2 · 2013-11-27

1

Entering edit mode

10.4 years ago

Sean Davis 26k

I think you might be describing a case of "premature optimization". I would suggest following a relatively standard workflow of alignment to the genome, post-processing, and quality control. For post-processing, you'll need to decide what steps are important to you and I would put indel realignment into a category of "maybe" with duplicate marking in the "must" category for most workflows. As for quality control, you might look at the picard tools to see if one meets your needs:

http://picard.sourceforge.net/picard-metric-definitions.shtml

There are, of course, many other softwares to choose from.

As for aligning to exons only, I think that would likely be highly counterproductive in terms of sequence yield AND potential discovery, but I have not seen a direct comparison between that and genomic alignment.

ADD COMMENT • link 10.4 years ago by Sean Davis 26k

0

Entering edit mode

Yes I understand the risk that might have happened, I was a bit curious to just have the exonic sequence and then align the reads to it and do a quality control and process downstream to yield true SNP, the most astonishing thing which I encountered is the exome bait bed files which comes with the target enrichment kit actually. The V4 sure select one, I checked it on the genome browser and much to my surprise I see few gene like NPM1 the exome bead from the bed file is actually not covering most part of the exome, so I might lose mutations that might happen on the region of the gene that is not under the bead. So either I have to align all the sequence to a exome construct or I can prepare a bed file which will be having beads that cover most part of the exome. The doubt in me struck when I was doing a quality control on the local realigned bam file. I was doing a quality control with the bam file , the reference genome hg19 from gatk bundle and the agilent bed file to see how many reads actually mapped on the exomes. And to my surprise it was one-thrid of the reads that actually align on the exonic region. So my curiosity is rising. Does anyone confirm me that doing a quality control with the walker DepthofCoverage/Diagnose Targets with GATK , how much reads one gets that got mapped on the exomes while using the exome baits provided with the kit, then I can have some confidence that might be one-third mapping is not wrong after all.

ADD REPLY • link 10.4 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

Having a good proportion (I'm consciously avoiding giving a number here) of your reads be off-target is not unusual. The enrichment is not perfect.

ADD REPLY • link 10.4 years ago by Sean Davis 26k

0

Entering edit mode

This means that the target enrichment step was not perfect before sequencing so the mapping of my reads are one-third on the exomes of the total mapped reads on the reference genome. But then I cant do more out of it and have to find out the mutations out of this one-third mapped reads over exome. I did another quality control with the exome baits that contain both the current enrichment kit added with the exonic region baits as well so that the bed file baits cover most of the exome and the QC metric shows that my read mapping now is around 50% of the total reads that got mapped on the reference genome. I guess if I use this same bed file for the SNP calling am sure I will get some more mutations which I might have missed out earlier. Any suggestions with the proceedings? Thank you for all the advices.

ADD REPLY • link 10.4 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

You don't really need to use a BED file for SNP calling, but you may if you like. If you choose to use one, then make sure that doing so is in line with available literature on the subject as well as your experimental questions.

ADD REPLY • link 10.4 years ago by Sean Davis 26k