Hi, I'm new to analyzing whole exome sequencing data so my question may be quite naive.
I have some bam
files to analysis, and my ultimate goal is to perform GATK's germline short variant workflow. As recommended by GATK, the first step should to make sure the bam files are "analysis-ready".
So my 1st question is, how can tell whether the bam files that I got are "analysis-ready"?
To be safe, I performed GATK's data pre-processing workflow on the bam
files that I have. Then I used samtools flagstat
to check their results:
for the original bam
file (file size: 14 GB):
88656828 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
15067704 + 0 duplicates
86814510 + 0 mapped (97.92% : N/A)
88656828 + 0 paired in sequencing
44328414 + 0 read1
44328414 + 0 read2
85770856 + 0 properly paired (96.74% : N/A)
86046282 + 0 with itself and mate mapped
768228 + 0 singletons (0.87% : N/A)
131052 + 0 with mate mapped to a different chr
63913 + 0 with mate mapped to a different chr (mapQ>=5)
The GATK processed bam
file (file size: 16 GB, which is 2 GB larger):
88861074 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
204246 + 0 supplementary
14042391 + 0 duplicates
88661994 + 0 mapped (99.78% : N/A)
88656828 + 0 paired in sequencing
44328414 + 0 read1
44328414 + 0 read2
87122220 + 0 properly paired (98.27% : N/A)
88327966 + 0 with itself and mate mapped
129782 + 0 singletons (0.15% : N/A)
294904 + 0 with mate mapped to a different chr
186027 + 0 with mate mapped to a different chr (mapQ>=5)
Comparing the results (which may also help for answering my 1st question), they are not exactly the same. So my 2nd question is, which bam
file should I use for later processing?
Thank you very much for your help!
Never use file size as QC criteria. If you are following GATK recommendations then stick with them.
Now sure why your total read number is slightly different.
The number of paired reads in sequencing is the same and the processed
bam
has 204246 supplementary reads (this number is the difference in the total reads) whereas the original has 0.Please help! :( Thank you!