I have N individuals (say, ind01, ..., ind96), all sequenced on the same lane, say the third one the flowcell, and I followed the htslib workflow:
- First, I demultiplexed the raw reads to get one fastq file per individual.
- Then, I aligned them on the reference genome with bwa using option -R '@RG\tID:3\tSM:ind01' for the first individual, and so on. (I then used samtools fixmate, and samtools sort.)
Thus, I have one bam file per individual (N in total for the lane).
I now would like to do local indel realignment as well as base quality score recalibration with GATK. As the tools work per lane, I would like to merge all bam files obtained at the end of step 2 above. But I don't know how to handle the @RG tags.
According to the SAM specification, each @RG line must have a unique ID. As BaseRecalibrator is read-group-aware, the ID should be "3". But I will then loose the information of which reads come from which individual (the sample tag, SM). I read the documentation on "samtools merge", but it doesn't look like it can handle my case.
Any idea?
Of course, once this 3rd step is done, I will want to split again the bam file into one bam file per individual. Moreover, as the same individual may be sequenced on several lanes, I will want to merge the bam files per individual across lanes before doing the SNP and genotype calling.
The per-lane stuff in the "GATK best practices" should be renamed "per-sample per-lane" steps. So there's no reason to merge different individuals from the same lane.
Do you have "ID:3" for all of your samples? That would be unusual.
Should I conclude from your message that one cannot use GATK BaseRecalibrator with multiplexed data?
No, that would in no way be a fair interpretation of what I wrote.
OK, sorry, I misunderstood your point.
I understand that I shouldn't use the same ID for different individuals. I can easily fix that. But then I don't know how to use BaseRecalibrator on all individuals of the same lane, since all ID will be different.
My problem comes from the fact that GATK documentation says: "The system will not work well on a small number of aligned reads. We usually expect well in excess of 100M bases from a next-generation DNA sequencer per read group".
What you advise is to use a single ID per individual in the lane. But what if a given individual has less than 100M bases? Moreover, I don't expect the "individual" effect to be very strong compare to the "lane" effect, "machine cycle" effect, "dinucleotide" effect, etc. That's why I initially used "ID=3" to remove confounders globally for the whole lane, but this obliges me to loose the information on which read comes from which individuals.
Hope I'm clearer.
GATK tools are read group aware, so you can run the BaseRecalibrator on a merge lane-level BAM file just fine.
Thanks, it seems indeed that the only possibility is to fit the model in BaseRecalibrator per individual, even if all individuals are merged into a single bam file with one read group per individual.
"ID" is actually supposed to be a unique identifier for the lane-library-flowcell combination. "LB" is for library, and "SM" is for sample. So, the IDs should be different between each of your 96 individuals.
In my case (individuals multiplexed on the same lane), would you put the same value LB = SM = ind01, etc, for each individual?
SM = LB = ind01, SM = LB = ind02, SM=LB=ind03, etc. This is assuming that there is only one library per sample.