12 months ago
Alright, so what you're actually asking is background info about the SAM header.
Just to cover all the basics (you may be aware of them already, but bear with me), here's an image I often use when I teach, which shows a schema of a typical SAM/BAM file.
The header section and the alignment section are very different in terms of their content and their format as you can see.
- the header contains information about how the alignment was generated and stored
- every line belonging to the header section begins with
@, followed by a "record type", such as
SQ, followed by
tag:value pairs where
tag is a two-letter string (such as
LN). Every record type has well defined tags that belong to it and every tag has a specific way in which its values are denoted. Take, for example, the record type
SQ, which stands for "reference sequence dictionary" in SAM spec speak or "reference genome" in bioinfo terms
- if you look up the SAM file specs, pages 3-5, you can see that for
SQ the following tags are allowed: SN, LN, AH, AN, AS, DS, M5, SP, UR
A typical entry for a hypothetical organism with 3 chromosomes of length 1000, 1500, and 3000, could be represented as follows in the header section:
@SQ SN:chr1 LN:1000
@SQ SN:chr2 LN:1500
@SQ SN:chr3 LN:3000
So, in summary:
- the header is theoretically optional, but often the very basic information such as the lengths of the chromosomes of the reference genomes are required by downstream tools
- EDIT following a comment by Genomax: if you decide to include a header with certain entries, such as
SQ, there are tags that may be required for a properly formatted SAM/BAM file (those are marked by asterisks in the SAM specs)
CO line is handy to keep track of the specific alignment command that was used to generate a BAM file -- if you're merging multiple BAM files, you either want to have multiple CO lines to indicate the differences between the commands that may have been used or you may just want to retain a single one if you used the same command for all the individual files or something else entirely - the choice is yours as to how much meta-data you want to keep in the header.
If you're just starting out you probably don't want to add your own custom-brewed entries to the header, I would recommend to use the one that contains the info that are relevant and correct for all the BAM files you're merging.
One more comment: I don't think you meant
RQ, I assume you're referring to
@RG. To find out more about the significance of that particular entry, you may find this biostars post helpful.
And one last question: Why do you want to merge the files in the first place?