STAR add RG
2
0
Entering edit mode
8.2 years ago
umn_bist ▴ 390

So I am transitioning into using STAR as our primary aligner for RNA-seq workflow and was curious if there's a way to add RG without knowing what it is beforehand - primarily because I am working with publicly available TCGA fastq files. These were not sequenced in-house.

I see --outSAMattrRGline and --outSAMreadID under STAR options but 1) the file needs to be in SAM/BAM format to extract RG (?) but our premapped files are in fastq format and 2) to add RGID I need to manually load (have prior knowledge of the RGID of our samples).

I looked up the filename format for TCGA files but it does not involve specific read group info. I am wondering if there's a way to extract RGID from fastq file and automatically input into the aligner for downstream analysis using GATK.

RNA-Seq STAR RGID GATK • 5.5k views
ADD COMMENT
2
Entering edit mode
8.2 years ago

You're over thinking this.

You already know the read group information for the TCGA samples, it's whatever you want it to be.

Also, fastq files don't contain meta-data.

ADD COMMENT
0
Entering edit mode

Oh okay, I see. Double checking - is it recommended/preferred to have the same RG for paired-end reads? I will ultimately have them merged into a single bam for variant calling downstream.

ADD REPLY
1
Entering edit mode

It is my understanding (although refer to Devon when he comes back) that the RGID is supposed to be unique to the lane/chip/sequencer as its the primary unit GATK uses to model noise in the data.
The samples ID (RGSM) needs to be the same for all samples you want to merge together and call SNPs on.

So my RGID is like "RG:Z:DBV2SVN1.D263EACXX.7.CTTGTA" in one bam,
"RG:Z:DBV2SVN1.D263EACXX.3.AATTCA" in another bam, (same chip, different lane and multiplex barcode),
but in BOTH bams the RGSM is "C57BL/6J" because they're both from the same isogenic mouse line.

When i merge them together, GATK will (hopefully) be able to call SNPs right... lets see :P

ADD REPLY
1
Entering edit mode

I'm in complete agreement with all of your comments :) For samples that you download you typically just use the SRA/ENA/whatever accession as the sample ID and library, unless the metadata online happens to suggest something more reasonable.

ADD REPLY
0
Entering edit mode

Since these files are single samples, I think I'm going to add an arbitrary RGID after merging paired-ends using picard (addorreplacereadgroup function). For variant calling (we're looking at rare, uncharacterized somatic mutations) is it preferred to merge all my samples into a single bam or call variants individually/separately?

ADD REPLY
2
Entering edit mode

Again, im not the authority on this - just someone who's read the docs a lot lately.
My understanding is that it's better to do joint calling than to do the calling individually and then look at overlapping (or all) SNPs called. To do joint calling however, you don't have to merge the BAM - you just have to make gVCFs for each BAM, and then run the joint caller on those gVCFs to 'merge' them at the variant level. However, as I found out the hard way, those gVCFs will have the sample ID in them, and they need to be the same for them to be joined.
So yeah, just give everything you want to treat as "the same" (genetically) the same sample ID, and you should be fine.

ADD REPLY
0
Entering edit mode

Awesome. The info will be very helpful. Thank you! Upvoted.

ADD REPLY
1
Entering edit mode

You should be aligning mates in a pair at the same time...

ADD REPLY
0
Entering edit mode
8.2 years ago
umn_bist ▴ 390

Deleted. See comment above.

ADD COMMENT

Login before adding your answer.

Traffic: 2718 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6