Dealing with multiple read groups, from BWA to sorted .bam
0
0
Entering edit mode
6.3 years ago
jamiedm • 0

Hi all,

My (paired-end) sequencing data is comprised of 5 samples each ran on 3 different lanes (6,7,8):

 Sample1_L6_R1.fastq.gz,     Sample1_L6_R2.fastq.gz
 Sample1_L7_R1.fastq.gz,     Sample1_L7_R2.fastq.gz
 Sample1_L8_R1.fastq.gz,     Sample1_L8_R2.fastq.gz
...
 Sample5_L6_R1.fastq.gz,     Sample1_L6_R2.fastq.gz
 Sample5_L7_R1.fastq.gz,     Sample1_L7_R2.fastq.gz
 Sample5_L8_R1.fastq.gz,     Sample1_L8_R2.fastq.gz

My understanding is that since each sample was ran on multiple lanes, it is important to specify read groups so that downstream applications such as GATK can distinguish them. For this reason, I align each lane file separately, adding read group information like this (for Sample1_L6):

bwa mem -t #threadno.# -R "@RG\tID:S1L6\tSM:S1\tPL:ILLUMINA\tLB:FC-140-1086" /path/to/hg38ref.fa /path/to/Sample1Lane6Read1.fq /path/to/Sample1Lane6Read2.fq > S1L6_alignment.sam

Naturally, I want to go from here to sorted .bam files (one for each sample like: Sample1_sorted.bam ... Sample5_sorted.bam), so I can then RemoveDuplicates and proceed with downstream analysis.

My question is, what would be the 'best' way to go from three unsorted .sam files to a sorted .bam file with read groups intact (preferably with Samtools)? By 'intact' I mean that each Samplex.bam would contain three different read groups corresponding to the lanes.

I presume that samtools view -b, samtools sort, and samtools merge/cat would be the tools I need, but in which order?

I originally tried merging and converting in one step like this:

samtools merge Sample1_unsorted.bam Sample1_L6_aligned.sam Sample1_L7_aligned.sam Sample1_L8_aligned.sam

I'm unsure if this is a valid use of the tools, and I think I read somewhere that samtools merge should be ran on sorted files anyway.

Any help or advice would be hugely appreciated!

DNA-seq sequence • 4.2k views
ADD COMMENT
1
Entering edit mode

This article from GATK might provide some helpful information:

How should I pre-process data from multiplexed sequencing and multi-library designs?

ADD REPLY

Login before adding your answer.

Traffic: 1531 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6