WGS data for pipeline development
3
1
Entering edit mode
6.2 years ago
win ▴ 970

Hi all, i am looking for paired end Illumina FASTA files from whole genome NGS run. I am developing some pipelines and was using a BAM file from 1000 genomes. After I run the GATK Best Practices pipeline I am generating a VCF with no data at all. Basically, i am converting BAM to FASTQ, then aligning to hg38 and then other steps as recommended. I am wondering what is so incorrect that no variants are detected?

Before I post questions about the pipeline itself, I wanted to be sure to use an input file which has been recommended by someone in this group.

Any help here is much appreciated and if there was an accompanying VCF that would be awesome.

##Align with BWA-mem.
#sudo ./algorithms/bwa/bwa mem -M references/hg38gatkbundle/Homo_sapiens_assembly38.fasta data/HG100/HG100.1.exome.fastq data/HG100/HG100.2.exome.fastq > data/HG100/HG100.aligned.reads.sam

##Sort the SAM file and it will also convert to BAM
#sudo java -jar algorithms/picard/picard.jar SortSam INPUT=data/HG100.aligned.reads.sam OUTPUT=data/HG100.sorted.reads.bam SORT_ORDER=coordinate

##Reheader the BAM. This is required for GATK.
#sudo java -Xmx8g -jar algorithms/picard/picard.jar AddOrReplaceReadGroups I=data/HG100/HG100.sorted.reads.bam O=data/HG100/HG100.sorted.reheader.reads.bam RGID=4 RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=20

##Build BAM index
#sudo java -Xmx8g -jar algorithms/picard/picard.jar BuildBamIndex INPUT=data/HG100/HG100.sorted.reheader.reads.bam

##Perform local realignment
#sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T RealignerTargetCreator -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -known references/hg38gatkbundle/Mills_and_1000G_gold_standard.indels.hg38.vcf -L references/hg38gatkbundle/wgs_calling_regions.hg38.interval_list -I data/HG100/HG100.sorted.reheader.reads.bam -o data/HG100/HG100.realignment.targets.list

##Perform local realignment of reads around indels
#sudo java -jar -Xmx8g algorithms/gatk3/gatk3.8.jar -T IndelRealigner -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -I data/HG100.sorted.reheader.reads.bam -known references/hg38gatkbundle/Mills_and_1000G_gold_standard.indels.hg38.vcf -known references/hg38gatkbundle/dbsnp_138.hg38.vcf -known references/hg38gatkbundle/Homo_sapiens_assembly38.known_indels.vcf -targetIntervals data/HG100.realignment.targets.list -o data/HG100/HG100.local.realigned.bam

##Perform Base Recalibration
#sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T BaseRecalibrator -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -I data/HG15/HG15.local.realigned.bam -knownSites references/hg38gatkbundle/dbsnp_138.hg38.vcf -o data/HG15/HG15.recal.data.table

##Run PrintReads
#sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T PrintReads -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -I data/HG15/HG15.local.realigned.bam -o data/HG15/HG15.output.bam --read_filter MappingQualityZero

##Run Unified Genotyper
#Use the following command if you also need to generate a gVCF file.
#sudo java -Xmx8g -jar algorithms/gatk3/gatk3.8.jar -T UnifiedGenotyper -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta --output_mode EMIT_ALL_SITES -I data/HG100.output.bam --dbsnp references/hg38gatkbundle/dbsnp_138.hg38.vcf -o data/HG100.output.raw.snps.indels.g.vcf


 ##Use the following command to generate a VCF file
    #sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T UnifiedGenotyper -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta --output_mode EMIT_VARIANTS_ONLY -I data/HG15/HG15.output.bam --dbsnp references/hg38gatkbundle/dbsnp_138.hg38.vcf -o data/HG15/HG15.output.raw.snps.indels.vcf

#### May have to use this one
#sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T UnifiedGenotyper -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -A BaseQualityRankSumTest -A Coverage -A QualByDepth -A FisherStrand -A ReadPosRankSumTest --output_mode EMIT_VARIANTS_ONLY -I data/HG15/HG15.output.bam --dbsnp references/hg38gatkbundle/Homo_sapiens_assembly38.dbsnp138.vcf -o data/HG15/HG15.output.raw.combined.vcf
###

## Run Variant Recalibrator for SNP
#sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T VariantRecalibrator -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -input data/HG15/HG15.output.raw.snps.indels.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 references/hg38gatkbundle/hapmap_3.3.hg38.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 references/hg38gatkbundle/1000G_omni2.5.hg38.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 references/hg38gatkbundle/1000G_phase1.snps.high_confidence.hg38.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 references/hg38gatkbundle/dbsnp_138.hg38.vcf -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile data/HG15/HG15.recalibrate.SNP.recal -tranchesFile data/HG15/HG15.recalibrate.SNP.tranches -rscriptFile data/HG15/HG15.recalibrate.SNP.plots.R

## Apply Variant Recalibration for SNPs 
#sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T ApplyRecalibration -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -input data/HG15/HG15.output.raw.snps.indels.vcf -mode SNP --ts_filter_level 99.0 -recalFile data/HG15/HG15.recalibrate.SNP.recal -tranchesFile data/HG15/HG15.recalibrate.SNP.tranches -o data/HG15/HG15.recalibrated.SNP.raw.indels.vcf

## Run Variant Recalibrator for INDEL
## ?? Need to check the QD problem here, for now we have eliminated the QD annotation flag.
#sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T VariantRecalibrator -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -input data/HG100.recalibrated.SNP.raw.indels.vcf -resource:mills,known=false,training=true,truth=true,prior=12.0 references/hg38gatkbundle/Mills_and_1000G_gold_standard.indels.hg38.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 references/hg38gatkbundle/dbsnp_138.hg38.vcf -an DP -an FS -an SOR -an MQRankSum -an ReadPosRankSum -an InbreedingCoeff -mode INDEL -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 -recalFile data/HG100.recalibrate_INDEL.recal -tranchesFile data/HG100.recalibrate_INDEL.tranches -rscriptFile data/HG100.recalibrate_INDEL_plots.R

## Apply Variant Recalibrator
#sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T ApplyRecalibration -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -input data/HG100.recalibrated_snps_raw_indels.vcf -mode INDEL --ts_filter_level 99.0 -recalFile data/HG100.recalibrate_INDEL.recal -tranchesFile data/HG100.recalibrate_INDEL.tranches -o data/HG100.recalibrated_variants.vcf
NGS WGS • 2.9k views
ADD COMMENT
5
Entering edit mode

Not what you are looking for, but you shouldn't use sudo to perform tasks like this. Sudo should only be used when you absolutely have to, for "sysadmin tasks".

ADD REPLY
2
Entering edit mode

It is highly recommended to use HaplotypeCaller instead of UnifiedGenotyper

more

ADD REPLY
0
Entering edit mode

I am not quite sure what you were running there, but could it be that the parameters of the pipeline are for use with multiple samples, not just one?

ADD REPLY
0
Entering edit mode

Please be more specific what you need. Tumor-only or matched normal or normal only? There are a couple of standard datasets that include validated mutations. Better start with these, e.g. NA12878. Still, please provide some details on your pipeline. Best would be to provide the exact commands you used and the alignment statistics. No results at all is suspicious. Every WGS sample, also healthy ones, contains thousands of germline mutations. There must be a bug somewhere in your pipeline.

ADD REPLY
0
Entering edit mode

OK! So I am looking for germline mutation so no need for tumor vs. sample. I am posting the entire pipeline just now.

ADD REPLY
0
Entering edit mode

Not really answering your question but maybe useful tips... To add Read Group to your bam files, run bwa mem with the -R STR options. This way you can eliminate the AddOrReplaceReadGroup step. Also, consider piping the output of bwa mem to samtools sort so you get the sorted bam file without passing by SAM. Even better, you could pipe to bamsort in biobambam so you can also index and mark duplicates on the fly (which probably you need to do). (I can't comment on the rest as I'm not familiar with germline variant calling).

ADD REPLY
0
Entering edit mode

Thanks, will implement some of the steps you have outlined.

ADD REPLY
1
Entering edit mode
6.2 years ago

While I am not diagnosing your problem directly, I have three general points that are very actionable.

  1. The approach to debugging a pipeline is to proceed through each step, one at a time, verifying that at each step you produce something like what you expect. This is especially important when the pipeline does not produce the expected results at the end.
  2. A successful pipeline will also include MANY quality control steps. These can help discover problems when they occur. The picard library has several useful tools for quality control (see the "Metrics" functions).
  3. As you learn about how your pipeline can fail, attempt to implement a test or a metric that can help you quickly diagnose your failure the next time it occurs. This can save you days of effort down the road.
ADD COMMENT
0
Entering edit mode

I put this comment here to be more relevant to the answer. Beside what was said by Sean Davis it is also nice if you follow one of the pipelines mentioned here in this review A review of bioinformatic pipeline frameworks, that will be more convenient.

ADD REPLY

Login before adding your answer.

Traffic: 1623 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6