Tutorial:Sam File Format - Lesser Known Tips And Tricks
2
29
Entering edit mode
11.3 years ago
Ya ▴ 300

First the defintion of the Sequence Alignment/Map (SAM). It is aTAB-delimited. Apart from the header lines, which are started with the ‘@’ symbol, each alignment line consists of:

Column Fields Description

  1. QNAME Query template/pair NAME
  2. FLAG bitwise FLAG
  3. RNAME Reference sequence NAME
  4. POS 1-based leftmost POSition/coordinate of clipped sequence
  5. MAPQ MAPping Quality (Phred-scaled)
  6. CIGAR extended CIGAR string
  7. MRNM Mate Reference sequence NaMe (‘=’ if same as RNAME)
  8. MPOS 1-based Mate POSistion
  9. LEN inferred Template LENgth (insert size)
  10. SEQ query SEQuence on the same strand as the reference
  11. QUAL query QUALity (ASCII-33 gives the Phred base quality)
  12. OPT variable OPTional fields in the format TAG:VTYPE:VALUE
sam • 23k views
ADD COMMENT
2
Entering edit mode

Let's use this thread to add information on the SAM format that may not always be obvious or well documented.

ADD REPLY
18
Entering edit mode
11.3 years ago

Common challenges

  1. If the reverse strand flag is set in column 2 (FLAG) ( with value 4) then the sequence reported in column 10 (SEQ) will be the reverse complement!
  2. The 5' end of reads on the forward strand correspond to colum 4 (POS). To find the 5' end of the read on the reverse strand you will need to use column 4 (POS) and add to that the length that you parse from the CIGAR string that indicates the length of the actual alignment. Writing custom code to do this properly and efficiently is a non-trivial task. One convenient approach is to convert to BED format via the bamtobed command of BedTools.
  3. The value of the CIGAR string (column 5) and the value of the edit strings in the column 12 options (OPTS) may be different. This depends on the aligner. The alignment process often contains of two steps: a fast heuristics and an optimal alignment. The two values may correspond to each of these processes.
  4. Color space aligners will usually produce a letter space sequence in the 10 (SEQ) column. Reverting that to the original color space representation is also challenging task (I know of no tools that can do that).
  5. Learn more about the value stored in column 5, mapping quality (MAPQ) at C: C: C: A: Why there are a lot of MQ0 reads in some particular regions?
  6. The SAM format is 1 based (like the GFF format). The BAM format is 0 based (like the BED format). Usually when viewing and processing BAM files we produce a SAM format from a BAM and the conversion is automatic. But if you read a BAM file directly you will need to account for the coordinate system differences.

Mathematical Model

See the mathematical models used in Samtools: http://www.broadinstitute.org/gatk/media/docs/Samtools.pdf

ADD COMMENT
2
Entering edit mode
11.3 years ago
Ying W ★ 4.2k

This is a handy website that will explain what the sam flags mean (convert numbers into flags and vice versa)

An unofficial fork by the main developer has some changes that have not yet been integrated into samtools.

ADD COMMENT

Login before adding your answer.

Traffic: 2687 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6