I have a mouse genome that was sequenced using 5 different mate-pair libraries and each library was run on 3 lanes on Illumina machine. I first aligned the reads at the lane level resulting into 15 bam files. Then I merged all the bam files (lanes) from the same library into a single BAM file resulting in 5 "single library BAM" files in total (each for one mate pair library). I want to use GATK to perform Indel realigner, Dedep and base score recalibration.
Assuming I have enough computational resources to run the GATK tool even on big bam files, what should be the correct order of performing these steps. I personally think, I should
1) Perform "IndelRealigner" at Library level OR for each "single library BAM" file separately. 2) Perform "Dedup" step at Library level to remove or mark redundant reads. 3) Using "TotalRecalibration" tool to perform quality score recalibration at single lane level or read group id level. GATK manual mentions that though a "single library BAM file" may contain reads from different read group or lanes, GATK will perform the recalibration at a lane level if RGID is provided in the BAM file for different lanes.
But I read a few recent papers, which have exactly the same situation as mine (1 sample -> multiple libraries -> each library run across more than one lane, No Barcoding) where IndelRealignment and was performed at lane level or single file, then Recalibration step was performed for each bam file separately and finally, lanes coming from the same library were merged together to form five "single library BAM file".
I just want to make sure if I am doing the things correct way?
Thanks.
RE your point on computational resource for big bam files, if you do happen to have access to GPUs, Parabricks is worth a try for running GATK on GPU: