The Meaning Of Samtools Flagstat Output Of Illumina'S Bam File
3
5
Entering edit mode
13.1 years ago
Junfeng ▴ 330

Hi, I have the bam files from illumina company. I obtained the following output from running the samtools flagstat command for one bam file:

1403478261 in total

100745504 QC failure

63676619 duplicates

1403478261 mapped (100.00%)

1403478261 paired in sequencing

704406339 read1

699071922 read2

1331910684 properly paired (94.90%)

1343780774 with itself and mate mapped

59697487 singletons (4.25%)

7717886 with mate mapped to a different chr

7574861 with mate mapped to a different chr (mapQ>=5)

Could anyone give me an explanation about the QC failure and properly paired from Illumina's company's bam file?

next-gen sequencing samtools illumina • 12k views
ADD COMMENT
4
Entering edit mode
13.1 years ago
Nina ▴ 400

If you want to do some googling, the QC metric used by Illumina is called the "chastity filter". Chastity is a measure of the signal to noise ratio and is defined as the "the ratio of the highest of the four (base type) intensities to the sum of highest two."

The chastity threshold is 0.6 and this threshold is applied to the first ~20 positions in the read, regardless of the read length. (I forget exactly how many positions are currently used, as the chastity filtering algorithm has changed over time) At most one base is allowed to fail to meet the 0.6 ratio threshold...if more than one base is below threshold the read is marked as failing QC. (For paired end reads, if either end of a pair fails to meet the threshold both ends are marked as QC failed.)

The most common source of chastity failure is two (or more) adjacent clusters being so physically close together that their signals cannot be measured independently.

ADD COMMENT
0
Entering edit mode

Got it. Thank you very much.

ADD REPLY
0
Entering edit mode

Any chance you have a reference for "the ratio of the highest of the four (base type) intensities to the sum of highest two." ? I can't find it on illumina's homepage.

ADD REPLY
3
Entering edit mode
13.1 years ago

QC Failure= too many Ns in your reads

properly paired = you have used a paired end sequencing. http://www.illumina.com/technology/paired_end_sequencing_assay.ilmn . Both left and right reads have been mapped on the same region of the genome at a distance compatible with the expected mean size of the fragments.

ADD COMMENT
0
Entering edit mode

Hi Pierre, thanks for your useful answer. For QC failure, do there exist a threshold for the definition, such as 50 Ns in the reads? By the way, would you like to tell me how to obtain the read length, just use samtools view command to have a look?

ADD REPLY
0
Entering edit mode

Typically, QC failure = not pass the purity filter. They may not have too many Ns.

ADD REPLY
0
Entering edit mode

For QC failure, see Nina's answer.

ADD REPLY
0
Entering edit mode
12.9 years ago
Ketil 4.1k

"Properly paired" is from the flags set by whatever aligner you've used. For instance, if you use bwa's 'sampe' alingment, you can specify the maximum insert length with -a. For a pair to be "properly paired" it would need to have both reads mapped to the same sequence within this distance.

ADD COMMENT
1
Entering edit mode

Please elaborate! My comment was just from checking the documentation, other sources say that BWA will infer the stats. But I notice that my 1.5K(?) mate-pair library fails to have its pairs matched.

ADD REPLY
0
Entering edit mode

That -a option usually does not work as what you have expected...

ADD REPLY

Login before adding your answer.

Traffic: 2221 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6