Order Of Gatk Commands
2
2
Entering edit mode
11.6 years ago

I have a mouse genome that was sequenced using 5 different mate-pair libraries and each library was run on 3 lanes on Illumina machine. I first aligned the reads at the lane level resulting into 15 bam files. Then I merged all the bam files (lanes) from the same library into a single BAM file resulting in 5 "single library BAM" files in total (each for one mate pair library). I want to use GATK to perform Indel realigner, Dedep and base score recalibration.

Assuming I have enough computational resources to run the GATK tool even on big bam files, what should be the correct order of performing these steps. I personally think, I should

1) Perform "IndelRealigner" at Library level OR for each "single library BAM" file separately. 2) Perform "Dedup" step at Library level to remove or mark redundant reads. 3) Using "TotalRecalibration" tool to perform quality score recalibration at single lane level or read group id level. GATK manual mentions that though a "single library BAM file" may contain reads from different read group or lanes, GATK will perform the recalibration at a lane level if RGID is provided in the BAM file for different lanes.

But I read a few recent papers, which have exactly the same situation as mine (1 sample -> multiple libraries -> each library run across more than one lane, No Barcoding) where IndelRealignment and was performed at lane level or single file, then Recalibration step was performed for each bam file separately and finally, lanes coming from the same library were merged together to form five "single library BAM file".

I just want to make sure if I am doing the things correct way?

Thanks.

bam gatk library • 4.7k views
ADD COMMENT
1
Entering edit mode

RE your point on computational resource for big bam files, if you do happen to have access to GPUs, Parabricks is worth a try for running GATK on GPU:

 $ docker run \
      --gpus all \
      --rm \
      --volume $(pwd):/workdir \
      --volume $(pwd):/outputdir \
    nvcr.io/nvidia/clara/clara-parabricks:4.0.0-1 \
    pbrun haplotypecaller \
      --ref /workdir/parabricks_sample/Ref/Homo_sapiens_assembly38.fasta \
      --in-bam /workdir/fq2bam_output.bam \
      --out-variants /outputdir/variants.vcf
ADD REPLY
5
Entering edit mode
11.6 years ago

I guess that the best practice would be to follow GATK's advice for best practices, wouldn't it?

I particularly use the "better" suggestion, since the merging step of the "best" suggestion has always given me problems due to internal sample labeling on SOLiD platforms. we would use it only on small targetted resequencing projects, but we've found out that all the steps suggested as "better" lead to fairly believable results.

ADD COMMENT
0
Entering edit mode

Yeah, I tend to go with the GATK's best practices as well, it is pretty straightforward and seems to work. I would use the better option but I often only have 1-3 exome samples per project and I've never been sure whether doing VQSR with samples from different projects (different diseases and families) is a good idea or not.

ADD REPLY
0
Entering edit mode

that's exactly the point I was trying to make. if you have mixed things it doesn't seem reasonable to treat them as a mixture. sure that if you work constantly with the same kits, reagents, sample types,... using the merging step of the best practices would be wise, but it is very rare the case that this happens on our lab... to date ;)

ADD REPLY
1
Entering edit mode
11.6 years ago

I don't know if I do it the "correct way", but here is my approach:

Align and de-dup separately.

sort and merge together with read groups.

Generate indel target intervals.

Run indel realignment.

ADD COMMENT

Login before adding your answer.

Traffic: 1740 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6