How to tell whether a bam file is "analysis-ready" (as defined by GATK)?
0
0
Entering edit mode
4.9 years ago
minimax • 0

Hi, I'm new to analyzing whole exome sequencing data so my question may be quite naive. I have some bam files to analysis, and my ultimate goal is to perform GATK's germline short variant workflow. As recommended by GATK, the first step should to make sure the bam files are "analysis-ready".

So my 1st question is, how can tell whether the bam files that I got are "analysis-ready"?

To be safe, I performed GATK's data pre-processing workflow on the bam files that I have. Then I used samtools flagstat to check their results:

for the original bam file (file size: 14 GB):

88656828 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
15067704 + 0 duplicates
86814510 + 0 mapped (97.92% : N/A)
88656828 + 0 paired in sequencing
44328414 + 0 read1
44328414 + 0 read2
85770856 + 0 properly paired (96.74% : N/A)
86046282 + 0 with itself and mate mapped
768228 + 0 singletons (0.87% : N/A)
131052 + 0 with mate mapped to a different chr
63913 + 0 with mate mapped to a different chr (mapQ>=5)

The GATK processed bam file (file size: 16 GB, which is 2 GB larger):

88861074 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
204246 + 0 supplementary
14042391 + 0 duplicates
88661994 + 0 mapped (99.78% : N/A)
88656828 + 0 paired in sequencing
44328414 + 0 read1
44328414 + 0 read2
87122220 + 0 properly paired (98.27% : N/A)
88327966 + 0 with itself and mate mapped
129782 + 0 singletons (0.15% : N/A)
294904 + 0 with mate mapped to a different chr
186027 + 0 with mate mapped to a different chr (mapQ>=5)

Comparing the results (which may also help for answering my 1st question), they are not exactly the same. So my 2nd question is, which bam file should I use for later processing?

Thank you very much for your help!

bam gatk bestpractice • 1.5k views
ADD COMMENT
0
Entering edit mode

Never use file size as QC criteria. If you are following GATK recommendations then stick with them.

Now sure why your total read number is slightly different.

ADD REPLY
0
Entering edit mode

The number of paired reads in sequencing is the same and the processed bam has 204246 supplementary reads (this number is the difference in the total reads) whereas the original has 0.

ADD REPLY
0
Entering edit mode

Please help! :( Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 3115 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6