Entering edit mode
5.7 years ago
serpalma.v
▴
80
Hello
I have 60 samples (samp1...samp60), each one was barcoded and then pooled (10 samples/pool, 6 pools).
Each pool was sequenced in 9 lanes.
This leads to 1080 fastq files ( 60 samples * 9 lanes * 2 (PE) ) and 540 bam files.
I want to do variant calling with GATK.
I went through these two very informative posts:
https://gatkforums.broadinstitute.org/gatk/discussion/6472/read-groups
Read Group In Sam/Bam Files: What Do They Exactly Describe?
Accordingly, I am trying to define the read groups for each bam file, as follows.
- ID: flowcell ID and lane ID (i.e. HNTW5BBXX_1)
- SM: the name of the sample (i.e. samp31)
- PL: ILLUMINA
- LB: lib_samp31
- PI: insert size (i.e. 200)
- PU: flowcell ID and lane ID and sample ID (i.e. HNTW5BBXX_1_samp31)
I would like to clarify the following:
- Did I get something wrong interpreting the fields?
- Could I exclude PU?, as it is not required by GATK, according to the link above. Do you usually include it anyway?
Thanks in advance!
Unless you have QC reasons to say that a lane did poorly, you should concatenate all 9 lanes together for each sample. Keeping them separate is doing you no favors. Merge the bams now before you do more.
I read here that keeping bams separated during pre-processing is reasonable. And also, the way I understood it, for each sample, every bam file corresponds to a different read group, as they are derived from reads produced by different lanes.
5 year old recommendations are no longer relevant, just concatenate the lanes together.
so then the read groups should be as follows:
Not sure about keepin PI and PU now...
Correct?