Sspace And -K Parameter
1
2
Entering edit mode
11.4 years ago
Ketil 4.1k

I'm trying to scaffold some assemblies with SSPACE 2.1. According to the manual:

The -k minimal links and -a maximum link ratio
---------------
Two parameters control scaffolding (-k and -a). The -k option specifies
the minimum number of links (read pairs) a valid contig pair must have to
beconsidered. The -a option specifies the maximum ratio between the
best two contig pairs for a given contig being extended.

In other words, higher -k value means stricter scaffolding, which should lead to fewer and shorter scaffolds. But what I see is that as this parameter increases, I get better N50 and more misassemblies (measured using linkage group information). This seems surprising, anybody have any ideas why that might be?

scaffolding assembly • 3.9k views
ADD COMMENT
3
Entering edit mode
11.3 years ago
Rayan Chikhi ★ 1.5k

The short answer is that increasing -k makes the contig graph (where nodes are contigs and edges are connections supported by paired reads) more sparse. A dense contig graph (with low -k value) is likely to yield shorter and more accurate scaffolds, because ambiguities are preserved. A slightly more sparse contig graph exhibits less branching, and assuming that -k isn't too high to mask most links, yields longer and less accurate scaffolds (as the scaffolder chooses less conservatively). A contig graph becomes too sparse when most true links between contigs become masked, hence yielding shorter scaffolds.

Consider a contig connected to two (or more) other contigs. It may happen that there is a significant difference in the number of reads supporting each link. This is can be due to un-even coverage, or multiple repetitions in that contig with different copy-counts. This doesn't indicate that the high-support link is the only correct one. Increasing the -k value may mask all other links but the high-support one, and SSPACE will consider that the high-support link is the only one present, thus it must be true, where in fact it may not be. Either the two links could be true (and the scaffolder shouldn't choose), or the high-support link is false and the one of the others should have been chosen (but the scaffolder typically couldn't know that).

The default -k value is quite low (5 I think), in many situations your data will have much more reads spanning contigs, thus, it is natural that you can afford to increase -k, but the phenomena described above will be amplified. Of course, at the same time, some artificial links will be removed and correct scaffolds will be produced.

ADD COMMENT
1
Entering edit mode

I think this must be correct, and that the graph is pruned of less-than-k edges before the difference in weight between links (-a parameter) is considered. Generally, I think it is better to keep k low (just filter out noise from mismapped reads), and use -a for stringency.

ADD REPLY

Login before adding your answer.

Traffic: 3212 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6