Question

Distance Between Read Sets

0

Entering edit mode

10.7 years ago

Lee Katz ★ 3.1k

Hi, given a few different Illumina raw read sets in fastq format or whatever format:

How would you most efficiently determine the genetic distance between them?

I am looking for a method that might be expanded on, given an increasing number of genomes. In other words, I would like to calculate the distances between the first few genomes; with the next Illumina run, I would want to calculate distances within the new genomes and between the new and last runs, but I would not want to re-calculate the distances within the first set of genomes.

I thought that maybe making assemblies would be the most efficient over time in combination with Mummer, but I am wondering if there is an assembly-free way to do it too.

distance • 1.6k views

ADD COMMENT • link updated 10.7 years ago by Eric Normandeau 11k • written 10.7 years ago by Lee Katz ★ 3.1k

score 2 · Answer 1 · 2013-08-15

A quick and dirty (and potentially computationally light) way to do this would be to compare the kmers present in the genomes.

3 pairwise metrics come to mind:

The proportion of kmers that are present in both genomes
The proportion of kmers that are unique to those genomes
Some form of kmer depth difference metric (like sum of the square of the kmer depth differences)

Since this is done in a pairwise fashion, the number of comparisons would go up as the square of the number of genomes. With enough memory though, you can count the kmers in all the genomes first and put the info in a hash table and the comparisons should be fast to do.