Biostar Beta. Not for public use.
Question: How to tell whether a bam file is "analysis-ready" (as defined by GATK)?
0
Entering edit mode

Hi, I'm new to analyzing whole exome sequencing data so my question may be quite naive. I have some bam files to analysis, and my ultimate goal is to perform GATK's germline short variant workflow. As recommended by GATK, the first step should to make sure the bam files are "analysis-ready".

So my 1st question is, how can tell whether the bam files that I got are "analysis-ready"?

To be safe, I performed GATK's data pre-processing workflow on the bam files that I have. Then I used samtools flagstat to check their results:

for the original bam file (file size: 14 GB):

88656828 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
15067704 + 0 duplicates
86814510 + 0 mapped (97.92% : N/A)
88656828 + 0 paired in sequencing
44328414 + 0 read1
44328414 + 0 read2
85770856 + 0 properly paired (96.74% : N/A)
86046282 + 0 with itself and mate mapped
768228 + 0 singletons (0.87% : N/A)
131052 + 0 with mate mapped to a different chr
63913 + 0 with mate mapped to a different chr (mapQ>=5)

The GATK processed bam file (file size: 16 GB, which is 2 GB larger):

88861074 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
204246 + 0 supplementary
14042391 + 0 duplicates
88661994 + 0 mapped (99.78% : N/A)
88656828 + 0 paired in sequencing
44328414 + 0 read1
44328414 + 0 read2
87122220 + 0 properly paired (98.27% : N/A)
88327966 + 0 with itself and mate mapped
129782 + 0 singletons (0.15% : N/A)
294904 + 0 with mate mapped to a different chr
186027 + 0 with mate mapped to a different chr (mapQ>=5)

Comparing the results (which may also help for answering my 1st question), they are not exactly the same. So my 2nd question is, which bam file should I use for later processing?

Thank you very much for your help!

ADD COMMENTlink 9 months ago minimax • 0
Entering edit mode
0

Never use file size as QC criteria. If you are following GATK recommendations then stick with them.

Now sure why your total read number is slightly different.

ADD REPLYlink 9 months ago
genomax
68k
Entering edit mode
0

The number of paired reads in sequencing is the same and the processed bam has 204246 supplementary reads (this number is the difference in the total reads) whereas the original has 0.

ADD REPLYlink 9 months ago
minimax
• 0
Entering edit mode
0

Please help! :( Thank you!

ADD REPLYlink 9 months ago
minimax
• 0

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0