Normally, we run NGS analysis in the follwing steps:
1、
~/bwa-0.7.16a/bwa mem -t 32 -M -R @RG\tID:1\tPL:Illumina\tSM:HCC1143 ~/reference/human_g1k_v37_decoy.fasta 1.fastq 2.fastq | samblaster --splitterFile my.split.sam --discordantFile
my.discord.sam > my.sam
2、
~/sambamba_v0.6.6 view -S -f bam -l 0 my.sam | ~/sambamba_v0.6.6 sort -t 32 -m 60G --tmpdir /data/ -o my.sort.bam /dev/stdin
~/sambamba_v0.6.6 index my.sort.bam
3、
~/sambamba_v0.6.6 markdup --tmpdir /data/ --overflow-list-size 800000 --remove-duplicates my.sort.bam my.nodup.bam
~/sambamba_v0.6.6 index my.nodup.bam
4、
java -Xmx60G -jar ~/GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar --analysis_type BaseRecalibrator -nct 32 --out my.recal.grp --disable_indel_quals --reference_sequence ~/reference/human_g1k_v37_decoy.fasta --input_file my.nodup.bam --knownSites ~/reference/dbsnp_137.b37.vcf --knownSites ~/reference/1000G_phase1.indels.b37.vcf --knownSites ~/reference/Mills_and_1000G_gold_standard.indels.b37.sites.vcf -L 20
java -Xmx60G -jar ~/GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar --analysis_type PrintReads -nct 32 --out my.base_recalibrated.bam --reference_sequence ~/reference/human_g1k_v37_decoy.fasta --input_file my.nodup.bam --BQSR my.recal.grp
5、
java -Xmx60G -jar -Djava.io.tmpdir=/data/ ~/GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar -T HaplotypeCaller -R ~/reference/human_g1k_v37_decoy.fasta -I my.base_recalibrated.bam --genotyping_mode DISCOVERY -o my_variants.vcf
The five steps above cost 60+ hours on a 32Core 128G machine, for dataset NA12878-Garvan-Vial1_R1.fastq.gz and NA12878-Garvan-Vial1_R2.fastq.gz. I build a parallel pipeline to speed up, and it indeed reduced the timecost to 15hours. And can anybody tells me whether this pipeline is ok or not?
1、 pigz -d *.fastq.gz
2、 split the 2 fastq files into 8 parts seperately. say, NA12878-Garvan-Vial1_R1_0.fastq~NA12878-Garvan-Vial1_R1_7.fastq, NA12878-Garvan-Vial1_R2_0.fastq~NA12878-Garvan-Vial1_R2_7.fastq, each part has almost the same reads(the same text line numbers).
3、 bwa mem |samblaster on 8 virtual machines simultaneously.
4、 use sambamba to get 8 sorted bam on 8 virtual machines simultaneously.
5、 use "bamtools split -reference" to split the 8 sorted bam, and then use "samtools merge" to get serveral bam files, each one is a chromosome. say, REF_1.bam, REF_2.bam ..... REF_22.bam, REF_X.bam, REF_Y.bam, REF_MT.bam. and also many other like REF_GL000249.1.bam
6、 start 8 virtual machines, and each one to markdup some chromosomes' bam.
7、 start 8 virtual machines, and each one to do BQSR on some chromosomes's bam.
8、 start 8 virtual machines, and each one to do HaplotypeCaller on some chromosomes's bam.
also you have to judge for yourself whether to do base recalibration or not (I do it for now but I only see a negligible effect) ... there was a discussion if you really need it.
I did split FASTQ files in step 2 "split the 2 fastq files into 8 parts seperately".
And then I call 8 "bwa mem" on 8 virutal machines parallelly.
"Split bam file and merge bam file" is to generate each chromosome's bam file.