Samtools sorting by read names vs. chromosomal coordinates
1
2
Entering edit mode
6.8 years ago
ropolocan ▴ 810

Samtools has the options of sorting alignments by read name or by chromosomal coordinate. Why would someone choose one over the other? What are some practical aspects to consider for each sorting method for downstream applications (e.g. counting the number of hits vs. the reference genome).

samtools sort alignment • 4.1k views
ADD COMMENT
8
Entering edit mode
6.8 years ago

Many programs require bam files to be sorted by coordinates; this can save memory when doing specific operations like variant-calling and coverage calculation. Also, programs like IGV need sorted, indexed bam files so that when you display a certain genomic area, they can rapidly access and display the relevant reads.

Name-sorting is useful when doing operations that require reads to be paired. Sam files often break interleaved pairing order, and coordinate-sorted bam files always do. With a coordinate-sorted bam, it can take a lot of time and memory to restore the original fastq read order (the original fastq cannot typically be fully restored). Name-sorting the file makes restoring pairing trivial, since the original reads are adjacent.

So - some downstream programs require a sorted, indexed bam. For those programs, that's what you need to provide. But when programs are capable of handling unsorted sam output, I suggest using a gzipped sam file with reads in the original order, which makes recovery or remapping of the original data much easier (aside from the inherent lossiness of the sam format, which will discard the original names), and generally makes pipelines faster compared to using bam files as an intermediate stage.

Note that as of samtools 1.4, the bam format is much faster and may be competitive with gzipped sam files, depending on the situation.

ADD COMMENT
0
Entering edit mode

Thank you for the very informative answer, Brian. Your answer is clear and I am now able to understand why one would prefer one way of sorting over the other.

ADD REPLY

Login before adding your answer.

Traffic: 2470 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6