Question

How To Proper Fill Up @Rg (Readgroups) Information? Example?

7

Entering edit mode

10.3 years ago

Carlos Borroto ★ 2.1k

Hi,

I did a search looking for the best definitive way to fill up @RG information in a BAM file. This is what I have so far. Please see the questions I still have at the end.

Let say I have two biological samples (SAMPLE01 and SAMPLE02). A library for each sample was build and sequenced twice using multiplexing in a MiSeq instrument. I them decided I wanted an extra sequence run but after building a new set of libraries. This is what I think should be the proper way to fill up @RG.

Run 1 using libraries 01:

@RG ID:SAMPLE01.R01 SM:SAMPLE01 PL:ILLUMINA PU:000000000-A6NRF:1 LB:SAMPLE01.L01
@RG ID:SAMPLE02.R01 SM:SAMPLE02 PL:ILLUMINA PU:000000000-A6NRF:1 LB:SAMPLE02.L01

Run 2 using libraries 01:

@RG ID:SAMPLE01.R02 SM:SAMPLE01 PL:ILLUMINA PU:000000000-A6NRF:1 LB:SAMPLE01.L01
@RG ID:SAMPLE02.R02 SM:SAMPLE02 PL:ILLUMINA PU:000000000-A6NRF:1 LB:SAMPLE02.L01

Run 3 using libraries 02:

@RG ID:SAMPLE01.R03 SM:SAMPLE01 PL:ILLUMINA PU:000000000-A6NRF:1 LB:SAMPLE01.L02
@RG ID:SAMPLE02.R03 SM:SAMPLE02 PL:ILLUMINA PU:000000000-A6NRF:1 LB:SAMPLE02.L02

My question is, Did I get PU right? I'm using here the FLOWCEL_ID:LANE_NUMBER. Should PU be unique among runs? Should I include the instrument run number to make it unique?

I guess I should also ask, can this be improve?, or would this be all that GATK needs?

Thanks, Carlos

gatk picard • 10.0k views

ADD COMMENT • link 10.2 years ago by Carlos Borroto ★ 2.1k

3

Entering edit mode

I think the RG is stored as a STRING for each read. So, the shorter your id is (ID:1 , ID:2 ), the smallest your bam will be.

ADD REPLY • link 10.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

That's true, but I'm working with what is going to be a very large data set. We already track samples with unique IDs. Coming up with a unique ID for RGs seems redundant. And with the size of our dataset even if I start numbering serially starting at 1, it will get large at one point. I feel is cleaner to have an easy translation from RG.ID to SAMPLE.ID.

ADD REPLY • link 10.3 years ago by Carlos Borroto ★ 2.1k

1

Entering edit mode

Everything seems fine to me. Most of the downstream analysis tool only use information from LB, SM, RG tags. In a very few cases, PL too. I have never seen any tool using information from PU tag but its always good to have all the tags listed properly.

ADD REPLY • link 10.3 years ago by Ashutosh Pandey 12k

0

Entering edit mode

That's interesting, cause watching a video from Broad(sorry I don't the link handy) explaining the importance of properly filling up RGs, they specifically mention modeling lane errors as an example. I would guess the way GATK can tell which reads are coming from the same lane is using PU. This is also why I think PU might need to be unique among runs.

Am I wrong?

ADD REPLY • link 10.3 years ago by Carlos Borroto ★ 2.1k

0

Entering edit mode

I think GATK uses read groups to determine if a read is coming from the same lane or not. Reads belonging to same lane should have same RG id.

ADD REPLY • link 10.3 years ago by Ashutosh Pandey 12k

0

Entering edit mode

That's not my understanding. I think each RG ID needs to be unique.

ADD REPLY • link 10.3 years ago by Carlos Borroto ★ 2.1k

0

Entering edit mode

Yes RG ids should be unique. Sequencers can sequence sample using different lanes. Lets assume you ran a library using lane number 1. Then you reran the same library in the same lane or some other lane. Now read numbers or read header numbers are assigned from a finite set of read numbers. It may happen that some reads from the first run have same read ids as the second run. Similarly, runs in different lanes can produce reads with same read ids. Now if you will merge these two fastq files and align it may throw some error or if you merge the individual bam files it may throw some error that read id already exists. So you need to provide unique RG IDs for these two runs or reads belonging to different lanes. That was the primary purpose of RG IDs. But now GATK also uses it to for BQSR.

ADD REPLY • link 10.3 years ago by Ashutosh Pandey 12k

0

Entering edit mode

I'm sorry I can 't completely follow your thoughts. But one thing I'm sure you got wrong here is there won't ever be two reads with the same read ID. See this link: http://en.wikipedia.org/wiki/FASTQ_format#Illumina_sequence_identifiers

The read ids include run number of the instrument, unique instrument identifier, lane number and lane location information. This combination makes read IDs unique even if you merge fastq files from different runs or even different instruments.

Also RG IDs are only to identify different combinations of RG tags. Every time any of tags change you should use a different RG ID. Meaning if you run two samples in the same lane you have two SM tags, you then need two RG IDs. This also means you now have two RG IDs for two groups run in the same lane. There is no way RG IDs can then be used to identify reads coming from the same lane.

I hope I explain myself better. I'll keep looking around. Maybe I should ask in the GATK forum.

Thanks, for helping understand this issue. Carlos

ADD REPLY • link 10.3 years ago by Carlos Borroto ★ 2.1k

0

Entering edit mode

after reading all the comments in this thread, I'm not sure if any downstream tools use the LB. Which ones?

Second if I we run a sample in 2 lanes, should it have the same ID and different LB? And again, what downstream tools would use that info.

ADD REPLY • link 10.2 years ago by brentp 24k

1

Entering edit mode

This is what I understand so far. LB is used to mark duplicates. Each sample should be labeled with SM and belong to a different @RG ID. I have confirmation @RG ID is used in BSQR.

If you run a sample in 2 lanes, you should have 2 @RG. They both should have the same LB if they come from the same library preparation. Almost always the case. It would look like this.

@RG ID:1.1 SM:1 LB:1
@RG ID:1.2 SM:1 LB:1

ADD REPLY • link 10.2 years ago by Carlos Borroto ★ 2.1k

0

Entering edit mode

I found point 9 on this page to be very helpful: http://gatkforums.broadinstitute.org/discussion/1317/collected-faqs-about-bam-files

ADD REPLY • link 10.1 years ago by brentp 24k

0

Entering edit mode

The interesting part if they keep saying '@RG ID' should be set to something like FLOWCELL.ID to make it unit among all sequencing in the world and to mark all reads coming out of the same lane for the BSQR error model. The problem is, you cannot do this if your are multiplexing several samples in one lane. I can't imagine why the GATK team overlooked that case.

ADD REPLY • link 10.1 years ago by Carlos Borroto ★ 2.1k

score 1 · Answer 1 · 2014-01-31

I got my question answered by Appistry. They provided comercial support for GATK in close collaboration with The Broad.

Appistry confirmed GATK does use @RG ID to tell which reads come from the same lane. @ashutoshmits was correct in his comment above. However, this does mean there is no way to mark reads from several samples in a multiplexing run as coming from the same lane. Appistry support mentioned you still get better results from the error models and the different covariates that BQSR uses, by running GATK with as much data as possible from the same lane. Even if there is no way to tell this is the case from @RG tags.

I'm still surprised GATK didn't use PU instead. I think that would be the perfect tag to avoid this situation. Picard already made PU required in their AddOrReplaceReadGroups tool. GATK however does not require PU.

--Carlos

score 0 · Answer 2 · 2014-01-09

0

Entering edit mode

10.3 years ago

Mitch Bekritsky ★ 1.3k

According to the SAM format specification:

PU: Platform unit (e.g., flowcell-barcode.lane for Illumina or slide for SOLiD). Unique identier.

I see you linked the SAM specification in your question. Have you gone over it yet?

ADD COMMENT • link 10.3 years ago by Mitch Bekritsky ★ 1.3k

0

Entering edit mode

Well, I didn't link the spec. It seems biostar.org did that for me.

But yes, I did look into the spec and that was the reason I used the flowcell:lane syntax. Do you know if the flowcell ID I see in the fastq read header if unique for the instrument or for the run?

This is the kind of read ID I'm seeing: @M00941:81:000000000-A5NM7:1:1101:14552:1574

Here the flowcell id is "000000000-A5NM7" the lane number is "1".

ADD REPLY • link 10.3 years ago by Carlos Borroto ★ 2.1k

0

Entering edit mode

Ah...sneaky! I'm pretty sure the read header is unique to the run. For instance, reads coming from the sequencing facility on my campus have headers looking something like (instrument name):(some number):(flowcell name):(lane):(other numbers). Do you have any file or LIMs that associates each sequencing file to a particular flowcell? That may be the easiest way to generate a PU tag.

ADD REPLY • link 10.3 years ago by Mitch Bekritsky ★ 1.3k

0

Entering edit mode

I know the read IDs are unique, I was asking about just the Flowcell ID.

We are hiring a lab to do the sequencing for us and my role is to make sure they provide BAM files with the proper information. Not clear to me how they are going to do it. I would guess they do have a LIMS. They are a pretty big lab.

ADD REPLY • link 10.3 years ago by Carlos Borroto ★ 2.1k

0

Entering edit mode

I would imagine the flowcell ID is unique as well. They are in the sequencing facility on my campus, but I would check with the sequencing lab you'll be working with.

ADD REPLY • link 10.3 years ago by Mitch Bekritsky ★ 1.3k