Questions about LB field in SAM specification for PCR duplicate removal
1
0
Entering edit mode
7.0 years ago
James Ashmore ★ 3.4k

Should the LB field in the SAM specification refer to the library preparation for the sample, or the library preparation carried out by the sequencing centre? Say I have a sample sequenced on multiple lanes of a single flowcell/machine, should they have the same library name? Or what if I have a sample which was sequenced on one lane/flowcell/machine on a certain date, and then sequenced again on a different lane/flowcell/machine. Would the reads from these two runs have the same library name?

My question arises because normally when I want to remove duplicates from multiplexed samples (all sequenced on the same machine/date) I just align the FASTQ files separately, then merge BAM files belonging to the same sample and run MarkDuplicates on the merged BAM. However I recently contacted GATK to ask whether read group information was necessary in this context and the answer was yes (http://gatkforums.broadinstitute.org/gatk/discussion/9310/read-group-information-required-for-markduplicates).

This confused me because if your sample was produced from a single library then merging and duplicate removal based on the 5' position alone should remove all duplicates (optical and library)?

sam markduplicates • 2.1k views
ADD COMMENT
0
Entering edit mode

I have faced a similar problem in the past. From what I know MarkDuplicates looks for duplicates within reads that belong to the same read group (RG), possibly checking the library part of the RG. All the data from the same library should have the same library in the RG. However, when you analyse your data in pieces you may find that at the end the RG field does not reflect the correct information. There are different ways to solve this. For example, if you are aligning with bwa you can ask it to include a proper pre-specified RG field.

ADD REPLY
0
Entering edit mode
7.0 years ago

My question arises because normally when I want to remove duplicates from multiplexed samples (all sequenced on the same machine/date) I just align the FASTQ files separately, then merge BAM files belonging to the same sample and run MarkDuplicates

different sample/lane/library should be given a different group ID in the sam header.

ADD COMMENT
0
Entering edit mode

What do you define as library?

ADD REPLY
0
Entering edit mode

Library is the DNA library, the preparation of DNA, where PCR duplicates arise. It doesn't matter if you run it on different lanes, you should treat all the reads from a library together when marking read duplicates.

ADD REPLY

Login before adding your answer.

Traffic: 2711 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6