Question

Picard Addorreplacereadgroups Results In Smaller File

2

Entering edit mode

11.8 years ago

DG 7.3k

Hi Everyone,

I have recently started doing mapping and variant calling on six whole-exome sequencing projects (6 different individuals). I have already mapped to the reference and converted the SAM files to BAM using Picard. I then added Read Group Data using AddOrReplaceReadGroups for each of the files and took the opportunity to also sort by coordinates. However, because I have added data I am a little puzzled that the resulting files are smaller in size as I started with BAM files to begin with. Each file is about 3-4 GB smaller in size. Is this normal or should I be worried? An example command line was:

java -Xmx2g -jar /usr/local/bin/AddOrReplaceReadGroups.jar INPUT=1804.bam OUTPUT=1804.sorted.bam SORT_ORDER=coordinate RGLB=8 RGPL=Illumina RGPU=1 RGSM=1804

Thanks everyone

bam exome-sequencing picard • 6.9k views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 11.8 years ago by DG 7.3k

score 8 · Answer 1 · 2012-07-18

8

Entering edit mode

11.8 years ago

brentp 24k

What is the original file size? Sorting should aid in compression because similar things are close together. You can alway check the number of reads by doing something like:

samtools view -F 4 -c 1804.sorted.bam
samtools view -F 4 -c 1804.bam

and you should get the same thing.

ADD COMMENT • link 11.8 years ago by brentp 24k

0

Entering edit mode

1804.bam: 15 Gigs 1804.sorted.bam: 11G

And you were right, looks like the same number of reads. Apparently the sorted BAM files just compress further which isn't something I quite expected, but makes perfect sense once I think about it.