Question

Dealing with multiple read groups, from BWA to sorted .bam

0

Entering edit mode

6.3 years ago

jamiedm • 0

Hi all,

My (paired-end) sequencing data is comprised of 5 samples each ran on 3 different lanes (6,7,8):

 Sample1_L6_R1.fastq.gz,     Sample1_L6_R2.fastq.gz
 Sample1_L7_R1.fastq.gz,     Sample1_L7_R2.fastq.gz
 Sample1_L8_R1.fastq.gz,     Sample1_L8_R2.fastq.gz
...
 Sample5_L6_R1.fastq.gz,     Sample1_L6_R2.fastq.gz
 Sample5_L7_R1.fastq.gz,     Sample1_L7_R2.fastq.gz
 Sample5_L8_R1.fastq.gz,     Sample1_L8_R2.fastq.gz

My understanding is that since each sample was ran on multiple lanes, it is important to specify read groups so that downstream applications such as GATK can distinguish them. For this reason, I align each lane file separately, adding read group information like this (for Sample1_L6):

bwa mem -t #threadno.# -R "@RG\tID:S1L6\tSM:S1\tPL:ILLUMINA\tLB:FC-140-1086" /path/to/hg38ref.fa /path/to/Sample1Lane6Read1.fq /path/to/Sample1Lane6Read2.fq > S1L6_alignment.sam

Naturally, I want to go from here to sorted .bam files (one for each sample like: Sample1_sorted.bam ... Sample5_sorted.bam), so I can then RemoveDuplicates and proceed with downstream analysis.

My question is, what would be the 'best' way to go from three unsorted .sam files to a sorted .bam file with read groups intact (preferably with Samtools)? By 'intact' I mean that each Samplex.bam would contain three different read groups corresponding to the lanes.

I presume that samtools view -b, samtools sort, and samtools merge/cat would be the tools I need, but in which order?

I originally tried merging and converting in one step like this:

samtools merge Sample1_unsorted.bam Sample1_L6_aligned.sam Sample1_L7_aligned.sam Sample1_L8_aligned.sam

I'm unsure if this is a valid use of the tools, and I think I read somewhere that samtools merge should be ran on sorted files anyway.

Any help or advice would be hugely appreciated!

DNA-seq sequence • 4.2k views

ADD COMMENT • link 6.3 years ago by jamiedm • 0

1

Entering edit mode

This article from GATK might provide some helpful information:

How should I pre-process data from multiplexed sequencing and multi-library designs?

ADD REPLY • link 6.3 years ago by Russ ▴ 500