Question

How about my parallel bwa+gatk pipeline?

1

Entering edit mode

6.4 years ago

bluemonster0808 ▴ 40

Normally, we run NGS analysis in the follwing steps:

1、

~/bwa-0.7.16a/bwa mem -t 32 -M -R @RG\tID:1\tPL:Illumina\tSM:HCC1143 ~/reference/human_g1k_v37_decoy.fasta  1.fastq 2.fastq | samblaster --splitterFile my.split.sam --discordantFile 
 my.discord.sam > my.sam

2、

~/sambamba_v0.6.6 view -S -f bam -l 0 my.sam | ~/sambamba_v0.6.6 sort -t 32 -m 60G --tmpdir /data/ -o my.sort.bam /dev/stdin

~/sambamba_v0.6.6 index  my.sort.bam

3、

~/sambamba_v0.6.6 markdup --tmpdir /data/ --overflow-list-size 800000 --remove-duplicates my.sort.bam my.nodup.bam

~/sambamba_v0.6.6 index my.nodup.bam

4、

java -Xmx60G -jar ~/GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar --analysis_type BaseRecalibrator -nct 32 --out my.recal.grp --disable_indel_quals --reference_sequence ~/reference/human_g1k_v37_decoy.fasta --input_file my.nodup.bam --knownSites ~/reference/dbsnp_137.b37.vcf --knownSites ~/reference/1000G_phase1.indels.b37.vcf --knownSites ~/reference/Mills_and_1000G_gold_standard.indels.b37.sites.vcf -L 20

java -Xmx60G -jar ~/GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar --analysis_type PrintReads -nct 32 --out my.base_recalibrated.bam --reference_sequence ~/reference/human_g1k_v37_decoy.fasta --input_file my.nodup.bam --BQSR  my.recal.grp

5、

java -Xmx60G -jar -Djava.io.tmpdir=/data/ ~/GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar -T HaplotypeCaller -R ~/reference/human_g1k_v37_decoy.fasta -I my.base_recalibrated.bam --genotyping_mode DISCOVERY -o my_variants.vcf

The five steps above cost 60+ hours on a 32Core 128G machine, for dataset NA12878-Garvan-Vial1_R1.fastq.gz and NA12878-Garvan-Vial1_R2.fastq.gz. I build a parallel pipeline to speed up, and it indeed reduced the timecost to 15hours. And can anybody tells me whether this pipeline is ok or not?

1、 pigz -d *.fastq.gz

2、 split the 2 fastq files into 8 parts seperately. say, NA12878-Garvan-Vial1_R1_0.fastq~NA12878-Garvan-Vial1_R1_7.fastq, NA12878-Garvan-Vial1_R2_0.fastq~NA12878-Garvan-Vial1_R2_7.fastq, each part has almost the same reads(the same text line numbers).

3、 bwa mem |samblaster on 8 virtual machines simultaneously.

4、 use sambamba to get 8 sorted bam on 8 virtual machines simultaneously.

5、 use "bamtools split -reference" to split the 8 sorted bam, and then use "samtools merge" to get serveral bam files, each one is a chromosome. say, REF_1.bam, REF_2.bam ..... REF_22.bam, REF_X.bam, REF_Y.bam, REF_MT.bam. and also many other like REF_GL000249.1.bam

6、 start 8 virtual machines, and each one to markdup some chromosomes' bam.

7、 start 8 virtual machines, and each one to do BQSR on some chromosomes's bam.

8、 start 8 virtual machines, and each one to do HaplotypeCaller on some chromosomes's bam.

gatk parallel pipeline ngs • 3.2k views

ADD COMMENT • link updated 4.9 years ago by Charles Warden 8.2k • written 6.4 years ago by bluemonster0808 ▴ 40

score 1 · Answer 1 · 2017-11-20

1

Entering edit mode

6.4 years ago

chen ★ 2.5k

Your bwa is not in parallel, which is one of the slowest steps. This will be your bottleneck.

Suggest you to split FASTQ first instead of splitting bam files. You can use tools like fastp or slicer to split your fastq files by lines.

ADD COMMENT • link 6.4 years ago by chen ★ 2.5k

0

Entering edit mode

also you have to judge for yourself whether to do base recalibration or not (I do it for now but I only see a negligible effect) ... there was a discussion if you really need it.

ADD REPLY • link 6.4 years ago by JJ ▴ 670

0

Entering edit mode

I did split FASTQ files in step 2 "split the 2 fastq files into 8 parts seperately".

And then I call 8 "bwa mem" on 8 virutal machines parallelly.

"Split bam file and merge bam file" is to generate each chromosome's bam file.

ADD REPLY • link 6.4 years ago by bluemonster0808 ▴ 40

score 0 · Answer 2 · 2017-11-20

I can only speak of my own experience and bwa is by far not the slowest step (but parallise what you can :) ) - it's GATK base recalibration and HC.

Have a look here - you could use the -nct option also for the HC - I use it and I never had a crash... (GATK says you should use it with caution as crashes can occur)

score 0 · Answer 3 · 2019-05-29

I think it is probably OK. However, if possible, I think there can be advantages to having multiple sets of variant calls, and it is essential that you have access to raw data for re-analysis.

For example, I have a summary of some re-analysis that I performed (with links to the scripts used for analysis) here:

http://cdwscience.blogspot.com/2019/05/precisionfda-and-custom-scripts-for.html

https://github.com/cwarden45/DTC_Scripts