Merge BAM files for a lane with multiplexed samples.
0
0
Entering edit mode
8.9 years ago

I have N individuals (say, ind01, ..., ind96), all sequenced on the same lane, say the third one the flowcell, and I followed the htslib workflow:

  1. First, I demultiplexed the raw reads to get one fastq file per individual.
  2. Then, I aligned them on the reference genome with bwa using option -R '@RG\tID:3\tSM:ind01' for the first individual, and so on. (I then used samtools fixmate, and samtools sort.)

Thus, I have one bam file per individual (N in total for the lane).

I now would like to do local indel realignment as well as base quality score recalibration with GATK. As the tools work per lane, I would like to merge all bam files obtained at the end of step 2 above. But I don't know how to handle the @RG tags.

According to the SAM specification, each @RG line must have a unique ID. As BaseRecalibrator is read-group-aware, the ID should be "3". But I will then loose the information of which reads come from which individual (the sample tag, SM). I read the documentation on "samtools merge", but it doesn't look like it can handle my case.

Any idea?

Of course, once this 3rd step is done, I will want to split again the bam file into one bam file per individual. Moreover, as the same individual may be sequenced on several lanes, I will want to merge the bam files per individual across lanes before doing the SNP and genotype calling.

multiplexing samtools baserecalibrator gatk • 4.4k views
ADD COMMENT
0
Entering edit mode

The per-lane stuff in the "GATK best practices" should be renamed "per-sample per-lane" steps. So there's no reason to merge different individuals from the same lane.

Do you have "ID:3" for all of your samples? That would be unusual.

ADD REPLY
0
Entering edit mode

Should I conclude from your message that one cannot use GATK BaseRecalibrator with multiplexed data?

ADD REPLY
0
Entering edit mode

No, that would in no way be a fair interpretation of what I wrote.

ADD REPLY
0
Entering edit mode

OK, sorry, I misunderstood your point.

I understand that I shouldn't use the same ID for different individuals. I can easily fix that. But then I don't know how to use BaseRecalibrator on all individuals of the same lane, since all ID will be different.

My problem comes from the fact that GATK documentation says: "The system will not work well on a small number of aligned reads. We usually expect well in excess of 100M bases from a next-generation DNA sequencer per read group".

What you advise is to use a single ID per individual in the lane. But what if a given individual has less than 100M bases? Moreover, I don't expect the "individual" effect to be very strong compare to the "lane" effect, "machine cycle" effect, "dinucleotide" effect, etc. That's why I initially used "ID=3" to remove confounders globally for the whole lane, but this obliges me to loose the information on which read comes from which individuals.

Hope I'm clearer.

ADD REPLY
0
Entering edit mode

GATK tools are read group aware, so you can run the BaseRecalibrator on a merge lane-level BAM file just fine.

ADD REPLY
0
Entering edit mode

Thanks, it seems indeed that the only possibility is to fit the model in BaseRecalibrator per individual, even if all individuals are merged into a single bam file with one read group per individual.

ADD REPLY
0
Entering edit mode

"ID" is actually supposed to be a unique identifier for the lane-library-flowcell combination. "LB" is for library, and "SM" is for sample. So, the IDs should be different between each of your 96 individuals.

ADD REPLY
0
Entering edit mode

In my case (individuals multiplexed on the same lane), would you put the same value LB = SM = ind01, etc, for each individual?

ADD REPLY
0
Entering edit mode

SM = LB = ind01, SM = LB = ind02, SM=LB=ind03, etc. This is assuming that there is only one library per sample.

ADD REPLY

Login before adding your answer.

Traffic: 1837 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6