Question

What Is The Best Pipeline For Human Whole Exome Sequencing?

121

Entering edit mode

13.9 years ago

Biomed 5.0k

We will be getting data from human whole exome sequencing done on Illumina GAIIx. I am planning to put together a pipeline to align (BWA?)(optional as we may get the sequence already aligned)|variant detection (Samtools?)|variant classification-deleteriousness prediction(something like SIFT/PolyPhen)|. I know that there are various options out there but I would like to know based on your experiences what the best collection of these tools is? If you had the chance to start from scratch what kind of a pipeline would you put together? I also know that big sequencing centers have their in house tool sets for these for squeezing the last drop but what is available to the general public is what I am looking for.

Thanks

next-gen sequencing illumina exome human • 72k views

ADD COMMENT • link updated 12.8 years ago by User 9126 ▴ 50 • written 13.9 years ago by Biomed 5.0k

8

Entering edit mode

You seem to have listed all the essential components of an exome sequencing pipeline. Exome sequencing is not yet sufficiently well-established to have a single "best-practice" pipeline available. It's still in the roll-your-own stage. You're going to have to experiment with the options for each component (aligner, SNP-caller, functional annotator, etc) to see which give the best results. You'll probably have to write a lot of glue to make the components fit together.

ADD REPLY • link 13.9 years ago by Ryan Thompson ★ 3.6k

0

Entering edit mode

Thanks for the answer. I do realize that there is no single best practice pipeline but I was hoping people might chip in with their own combinations of "this works best for me" collections. I think I 'll just have to get the answer myself at this point.

ADD REPLY • link 13.9 years ago by Biomed 5.0k

1

Entering edit mode

This is really wonderful thread.

I am little bit confused in using this pipeline if i have say 10 samples.

Should I first align all files to the genome independently,combine as single bam file and proceed further.? or should I need to process every file independently through all these steps?

Thanks Santhosh

ADD REPLY • link 12.8 years ago by User 9126 ▴ 50

0

Entering edit mode

This is not a discussion forum. Questions like yours usually go as a separate item on this site. Link to this question when asking your question.

ADD REPLY • link 12.8 years ago by Hranjeev ★ 1.5k

0

Entering edit mode

Since I was very unsatisfied with confirming variants produced by CASAVA, I tried to setup this pipeline described abouve by Quigley and L for our HiSeq/SeqCapEZ exome data.

After a long weekend and some hassle with incompatible Java versions and a reference concatenated (hGRC37, but copied together in a non-karyotypic way) in the wrong way, I now make it to the 'samtools calmd...' part.

But all I get back is the error message 'Floating point exception', which I have no idea how to deal with. I feel short of going beserk. Anyone seen this before that can tell me how to deal with it?

Best regards,
Michael Gombert

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 12.6 years ago by Michael Gombert • 0

1

Entering edit mode

Michael, this is not a discussion thread, although it may appear that way, if you have a question to ask please post it as a new question so it can be answered and archived to help people in the future

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 12.6 years ago by User 59 13k

Ram · Answer 1 · 2010-07-23

108

Entering edit mode

13.7 years ago

David Quigley 11k

Here is our basic pipeline. We are novices, but this seems to work for us. Please investigate all of the tools carefully, as (I'll repeat) I am not an expert in this analysis. Note that I am using mostly default values for the analysis; advice I gotten from experienced hands is it's routine to use the intersection of multiple SNP calling methods, and that indel calling is still an art. Note that this is set up for single reads, not paired reads, but the same basic pipeline should apply.

PREPROCESS:

Index human genome (Picard), we used HG18 from UCSC.
Convert Illumina reads to Fastq format
Convert Illumina 1.6 read quality scores to standard Sanger scores

FOR EACH SAMPLE:

Align samples to genome (BWA), generates SAI files.
Convert SAI to SAM (BWA)
Convert SAM to BAM binary format (SAM Tools)
Sort BAM (SAM Tools)
Index BAM (SAM Tools)
Identify target regions for realignment (Genome Analysis Toolkit)
Realign BAM to get better Indel calling (Genome Analysis Toolkit)
Reindex the realigned BAM (SAM Tools)
Call Indels (Genome Analysis Toolkit)
Call SNPs (Genome Analysis Toolkit)
View aligned reads in BAM/BAI (Integrated Genome Viewer)

SAMPLE SCRIPT

/bin/BWA/bwa aln -f /output/FOO.sai \
            -t 3 /seq/REFERENCE/human_18.fasta \
            /seq/FQ/FOO.sanger.fq

/bin/BWA/bwa samse -f /output/FOO.sam \
            /seq/REFERENCE/human_18.fasta \
            /output/FOO.sai \
            /seq/FQ/FOO.sanger.fq

/bin/SAMTOOLS/samtools import /seq/REFERENCE/human_18.fasta \
            /output/FOO.sam \
            /output/FOO.bam

/bin/SAMTOOLS/samtools sort /output/FOO.bam /output/FOO.sorted

/bin/SAMTOOLS/samtools index /output/FOO.sorted.bam \
            /output/FOO.sorted.bam.bai

java -jar /bin/GTK/GenomeAnalysisTK.jar -T RealignerTargetCreator \
            -R /seq/REFERENCE/human_18.fasta \
            -I /output/FOO.sorted.bam \
            -o /output/FOO.intervals

java -jar /bin/GTK/GenomeAnalysisTK.jar -T IndelRealigner 
            -R /seq/REFERENCE/human_18.fasta \
            -I /output/FOO.sorted.bam \
            -targetIntervals /output/FOO.intervals \
            --output /output/FOO.sorted.realigned.bam

/bin/SAMTOOLS/samtools index /output/FOO.sorted.realigned.bam \
            /output/FOO.sorted.realigned.bam.bai
java -jar /bin/GTK/GenomeAnalysisTK.jar -T IndelGenotyperV2 \
            -R /seq/REFERENCE/human_18.fasta \
            -I /output/FOO.sorted.realigned.bam \
            -O /output/FOO_indel.txt \
            --verbose \
            -o /output/FOO_indel_statistics.txt

java -jar /bin/GTK/GenomeAnalysisTK.jar -T UnifiedGenotyper \
            -R /seq/REFERENCE/human_18.fasta \
            -I /output/FOO.sorted.realigned.bam \
            -varout /output/FOO.geli.calls \
            -vf GELI \
            -stand_call_conf 30.0 \
            -stand_emit_conf 10.0 \
            -pl SOLEXA

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 13.7 years ago by David Quigley 11k

2

Entering edit mode

I haven't yet; our region of interest was quite small, and we only used single reads for that experiment. I think the discussion here is a good place to start; the respondents know a lot more than I do about this question.

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 13.0 years ago by David Quigley 11k

1

Entering edit mode

Very interesting and detailed suggestion. Will try this out once our exome data is ready.

ADD REPLY • link 13.7 years ago by Khader Shameer 18k

1

Entering edit mode

Do you do base quality score recalibration? If not, why so? Thanks for sharing!

ADD REPLY • link 12.6 years ago by Adrian Cortes ▴ 550

0

Entering edit mode

great start. thanks this was what I was looking for when I asked this question. I hope we can build on this and share our experiences as we are not experts either but together I believe we can become experts much faster. I am still hung up on the preprocessing step. how did you Convert Illumina reads to Fastq format and Convert Illumina 1.6 read quality scores to standard Sanger scores? Do you have to convert pipeline 1.6 quality scores to sanger scores?

ADD REPLY • link 13.7 years ago by Biomed 5.0k

0

Entering edit mode

To convert Illumina to fastq I think I used MAQ (maq sol2sanger in.sol.fastq out.sanger.fastq). See http://maq.sourceforge.net/fastq.shtml, http://seqanswers.com/forums/showthread.php?t=1801 . It may be that the quality scores are not relevant for BWA, please chime in if you know this. My first alignment used MAQ; I switched to BWA because it can do gapped alignment, and it's faster. seqanswers.com is highly recommended for getting started.

ADD REPLY • link 13.7 years ago by David Quigley 11k

0

Entering edit mode

I have a question about the workflow provided by you. In the SNP calling script by GATK there is a parameter -vf GELI. what is the meaning of this as i was trying to search this one on the Broad website but there was no description. It will be very helpful if you can describe all the parameters.

Thanks

Saurabh

ADD REPLY • link 13.7 years ago by Saurabh Baheti • 0

0

Entering edit mode

The version of the UnifiedGenotyper that I used a few weeks ago had a flag -vf which let you output GELI format for the genotype calls. The default was VCF. Apparently they have removed that flag in the latest version and only output VCF, see http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper. To get help with all flags, type java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper --help.

ADD REPLY • link 13.7 years ago by David Quigley 11k

0

Entering edit mode

Thanks David for the clarification. I have one more question regarding the Indel Calling using GATK. Default parameters for GATK Indel calling are 6X coverage and min of 2 indel-suported read then it will call that variant as indel. But when i compare the samtools indel calls with same parameters, samtools gives double the number of indels than GATK. And on uisng IGV to visualize the exclusive indels by samtools they are not false positive. Just a curious question do you recommend using GATK for indel calls??

ADD REPLY • link 13.7 years ago by Saurabh Baheti • 0

0

Entering edit mode

Thank you David for precise illustration of whole tools :)

ADD REPLY • link 13.4 years ago by Thaman ★ 3.3k

0

Entering edit mode

Hello Saurabh, sorry I missed your comment before. In my limited experience, the GATK and MAQ genotype algorithms give very different results. It's not my specialty, so I can't comment on which is preferable; the only thing we've done is Sanger sequencing to work out a gold standard and adjust backwards from there. Hopefully someone who know more than I can point to published work that clears this up.

ADD REPLY • link 13.4 years ago by David Quigley 11k

0

Entering edit mode

Hi David, thanks for the pipeline example. I notice that although you indexed the reference genome you don't seem to be using an indexed file on any of the parameters. What am I missing? Cheers

ADD REPLY • link 13.2 years ago by Chris Beck • 0

0

Entering edit mode

Hi David: Have you used any tools to remove/mark PCR duplicates ?

ADD REPLY • link 13.0 years ago by Khader Shameer 18k

0

Entering edit mode

Thanks David. That link is very useful.

ADD REPLY • link 13.0 years ago by Khader Shameer 18k

0

Entering edit mode

David: I am able to run the entire steps on my paired end data with minor modifications. I would like to know how did you integrated dbSNP / rsIDs to your .VCF files in the UnifiedGenotyper step.

ADD REPLY • link 12.9 years ago by Khader Shameer 18k

0

Entering edit mode

In theory you can get this automatically using the -D parameter to pass in a ROD file, available in the GATK bundle at http://www.broadinstitute.org/gsa/wiki/index.php/GATK_resource_bundle . In practice this bundle was literally not present on the broad FTP server when I went looking for it a few weeks ago, so I just pulled down rsIDs from UCSC using the same genome build as my mapping and appended them to my output file with a python script using a hash of location to rsID.

ADD REPLY • link 12.9 years ago by David Quigley 11k

0

Entering edit mode

Thanks David, that's really helpful. Also I noted that ROD file is not required for dbSNP132 with latest version of GATK. I can directly use VCF. http://getsatisfaction.com/gsa/topics/dbsnp132_rod

ADD REPLY • link 12.9 years ago by Khader Shameer 18k

0

Entering edit mode

Thanks for pointing that out; I think that step should be added to this pipeline. See this NG technical report, particularly figure 3: http://www.nature.com/ng/journal/v43/n5/full/ng.806.html

ADD REPLY • link 12.6 years ago by David Quigley 11k

0

Entering edit mode

Great. MAGI, cBioPortal can be added into the pipeline.

ADD REPLY • link 5.2 years ago by Shicheng Guo ★ 9.4k

Ram · Answer 2 · 2010-11-22

26

Entering edit mode

13.4 years ago

lh3 33k

My this answer better serves as a comment to David Quigley's, which is great, but to emphasize the recent improvements, I still give it as separate answer.

Between step 7 and 8, it is recommended to add an additional step:

samtools calmd -Abr FOO.sorted.bam human_18.fasta > FOO.baq.bam

This will substantially improve SNP specificity.

A few other comments to David Quigley's pipeline are:

It is recommended to use Dindel for indel calling. It is the few indel callers that properly model indels (most not, including IndelGenotyperV2). Broad is reimplementing the Dindel model in GATK. The latest samtools also models indels properly, but it is not evaluated as widely as Dindel.
For the human genome build 36, it is recommended to use the version used by the 1000 genomes project. You will lose genes in the pseudoautosomal regions if you use hg18. Hg19/b37 is fine.
The samtools and GATK SNP callers are converging in algorithms and in results. GATK is arguably better than MAQ, which has been confirmed by several independent studies.

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 13.4 years ago by lh3 33k

1

Entering edit mode

BAQ has been added to the GATK pipeline. See more detailed information about how to use it yourself here:

http://www.broadinstitute.org/gsa/wiki/index.php/Per-base_alignment_qualities_(BAQ)_in_the_GATK

ADD REPLY • link updated 11.0 years ago by brentp 24k • written 13.4 years ago by Michael.James.Clark ▴ 560

0

Entering edit mode

Thanks for the update Heng.

ADD REPLY • link 13.4 years ago by Biomed 5.0k

0

Entering edit mode

Hi Hemp,

I am confused as to what set of analysis/commands once needs to do after running calmd? Is the .baq.bam file going to input to what's next? Sorry I am newbie to this analysis and am confused what I am supposed to do next with calmd output.

ADD REPLY • link 12.3 years ago by Angel ▴ 220

0

Entering edit mode

What does samtools calmd do? I do not understand the word "cap" in man-page definition. Thanks.

ADD REPLY • link 11.5 years ago by Biomonika (Noolean) 3.2k

score 8 · Answer 3 · 2010-06-01

Not exome sequencing specific, but you may start with improving your sequence first:

Illumina's pipeline output may differ between versions. Newer version (1.6) should handle better "crowded" lanes
you may try to use an alternative basecaller (Ibis http://bioinf.eva.mpg.de/Ibis/)
this one use with caution: there are few programs for data correction, i.e. Correction tool for SOAPdenovo.

This should improve numbers of mapped reads. I have to admit I did not test the whole pipeline (I got hijacked by something else for time being), and my goal was not SNP but transcript (any transcript) detection.

As for mapping itself, it depends if you also look for novel splicing / exon skipping etc. If yes, then you need a mapper capable of spliced mapping.

An incomplete list of mappers: http://openwetware.org/wiki/Wikiomics:RNA-Seq

score 7 · Answer 4 · 2010-11-22

For simultaneous SNP, MNP, and INDEL calling you can also use FreeBayes, which has the added benefit of optionally using other samples from the same population to increase callset quality.

It generates VCF from BAM files. It benefits greatly from realignment preprocessing steps which David Quigley describes.

score 3 · Answer 5 · 2010-11-23

3

Entering edit mode

13.4 years ago

Laura ★ 1.8k

For functional annotation of your snps/indels once you have them you might want to look at the ensembl variation effect predictor which can be used both on the website for small number of snps or using the api for large numbers

http://www.ensembl.org/tools.html

ADD COMMENT • link 13.4 years ago by Laura ★ 1.8k

score 3 · Answer 6 · 2010-12-15

Thank you for this wonderful thread which is full of useful information from my perspective.

If the platform was SOLiD, what variation of this excellent post and (kindly provided) code from David Quigley would be required?

1) Does use of human_g1k_v37 provide advantage over HG18? (sorry if this has been addressed elsewhere. could not find the specific answer).

2) When using BWA with Colorspace data. There is this thread in Seqanswers mentioning possible problems (longer read lengths while indexing). http://bit.ly/hB4IO8 and there is mention of colorspace support on BWA manual http://bit.ly/dgl1XD

3) Any issues noted with GATK. I have not used GATK, but have overheard conversations between people who have tried about having trouble with atleast some aspects of it.

Thanks

JVJ

score 3 · Answer 7 · 2011-06-20

3

Entering edit mode

12.8 years ago

Roman Valls Guimerà ▴ 620

We've been using Brad Chapman's one for a while now in production ourselves here in Sweden:

https://github.com/chapmanb/bcbb/blob/master/nextgen/README.md

ADD COMMENT • link 12.8 years ago by Roman Valls Guimerà ▴ 620

score 1 · Answer 8 · 2010-05-31

1

Entering edit mode

13.9 years ago

Kevin ▴ 640

You might wanna check out usegalaxy.org for NGS pipeline. There are others as well. offhand I cannot remember them.

ADD COMMENT • link 13.9 years ago by Kevin ▴ 640

6

Entering edit mode

I know galaxy but I can not see how I can use it to 1. Align 2. Variant call 3. Variant deleteriousness predict. It is nice for manipulating common file formats, convert between them and matching your data against common public data sets but this was not my question.

ADD REPLY • link 13.9 years ago by Biomed 5.0k

5

Entering edit mode

Hi @Kevin. Thanks for contributing to BioStar. Your reply is rather shortish and vague. Please consider putting more time to write answers that are clear, detailed, and useful. You will thus be more helpful to the members of this community. Cheers

ADD REPLY • link 13.9 years ago by Eric Normandeau 11k

score 1 · Answer 9 · 2011-02-16

1

Entering edit mode

13.2 years ago

Gard Thomassen ▴ 10

Interesting thread.

Does GATK work well with exome sequencing when it comes to the realigning part? When using it on RNA-seq I run into truble as it was originally designed for complete genome sequencing and it kind of freaks out when it finds that you only have transcript data and no data for the introns... This messes up the sliding window algorithm used in the realignment part. Have you fixed the code to take care of this? A fix was suggested on the Broad-help-site but I have not gotten around to pursue it. Or has the GATK been upgradeded to cope with exome and RNA-seq NGS data?

Best regards Gard

ADD COMMENT • link 13.2 years ago by Gard Thomassen ▴ 10

0

Entering edit mode

Truth be told I don't know, Gard. When I posted our pipeline we had just done a genome sequencing project; we were "about to do" an exome project "very soon" but I haven't got the data yet. At least I did say I was a novice at this kind of analysis. If you find a good answer please post, as my answer is deficient without it.

ADD REPLY • link 13.2 years ago by David Quigley 11k

Ram · Answer 10 · 2015-02-05

Since this 4 year old thread appears to still have a bit of life in it I'll add yet another answer.

The Utah Genome Project maintains a wiki describing the steps and command lines behind it's variant calling pipeline. This is the standard 'GATK Best Practices' pipeline using HaplotypeCaller, but for someone running this for the first time it can be difficult to find a site that provides actual detailed command lines to work through.

http://weatherby.genetics.utah.edu/UGP/wiki/index.php/UGP_Variant_Pipeline

There is also a Perl based implementation of the pipeline: http://srynobio.github.io/cApTUrE/

Finally, gVCFs for several 1KG populations are available for download to enable joint genotyping for improved variant quality.

Barry

score 0 · Answer 11 · 2013-10-07

Thank you for the valuable thread. I have some more query for which I need some suggestions, I am new to GATK and want to use it for my exome sequencing data analysis. I have been a bit lost reading all the blogs , comments and the technical forums. So here is something I want to say and please correct and guide me through the procedure. I have downloaded the hg19 files from the UCSC browser and created the reference genome but do I need to again use the one which is there in GATK repository and then align my samples for downstream analysis? Also I want to run the GATK in my institute cluster. So if am not wrong I should create the directory of the latest GATK version and transfer all the necessary files via Filezilla in the cluster directory with the same name. Now this I have already done. So next thing is to download the bundle from the repository where I see 2 versions , so which one should I download? 2.5 or 2.3? Also once I download the bundle do I have to download anything else? So here it is which I should be downloading right in my cluster. The jar file and the resource folder with the .java files and then in the main directory of the GATK version folder in my cluster I should download the bundle version (2.5 or 2.3) and then unzip all the files that are there in the bundle directory. Right? Please let me know. Then I should be ready to use the GATK for the different downstream processes listed below:

Identify target regions for realignment (Genome Analysis Toolkit) ->Realign BAM to get better Indel calling (Genome Analysis Toolkit) ->Reindex the realigned BAM (SAM Tools) ->Call Indels (Genome Analysis Toolkit) ->Call SNPs (Genome Analysis Toolkit)->View aligned reads in BAM/BAI (Integrated Genome Viewer)

Please let me know if this looks correct or not. The VCF files from the 1kG and the DBSNP are already there in compressed form in the bundle repository of the GATK website which I am currently downloading and I can use them directly after unzipping them.

Ram · Answer 12 · 2014-09-25

0

Entering edit mode

9.6 years ago

Saad Khan ▴ 440

There is a pipeline made by Baylor college of medicine folks implemented at DNA nexus and publicly available at https://www.hgsc.bcm.edu/software/mercury. Its a good pipeline to give annotated vcf from which user could start his/her downstream analysis.

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by Saad Khan ▴ 440