Hi Everyone,
I have recently started doing mapping and variant calling on six whole-exome sequencing projects (6 different individuals). I have already mapped to the reference and converted the SAM files to BAM using Picard. I then added Read Group Data using AddOrReplaceReadGroups for each of the files and took the opportunity to also sort by coordinates. However, because I have added data I am a little puzzled that the resulting files are smaller in size as I started with BAM files to begin with. Each file is about 3-4 GB smaller in size. Is this normal or should I be worried? An example command line was:
java -Xmx2g -jar /usr/local/bin/AddOrReplaceReadGroups.jar INPUT=1804.bam OUTPUT=1804.sorted.bam SORT_ORDER=coordinate RGLB=8 RGPL=Illumina RGPU=1 RGSM=1804
Thanks everyone
1804.bam: 15 Gigs 1804.sorted.bam: 11G
And you were right, looks like the same number of reads. Apparently the sorted BAM files just compress further which isn't something I quite expected, but makes perfect sense once I think about it.
That's an excellent explanation. Nice thinking.