Distance Between Read Sets
1
0
Entering edit mode
10.7 years ago
Lee Katz ★ 3.1k

Hi, given a few different Illumina raw read sets in fastq format or whatever format:

How would you most efficiently determine the genetic distance between them?

I am looking for a method that might be expanded on, given an increasing number of genomes. In other words, I would like to calculate the distances between the first few genomes; with the next Illumina run, I would want to calculate distances within the new genomes and between the new and last runs, but I would not want to re-calculate the distances within the first set of genomes.

I thought that maybe making assemblies would be the most efficient over time in combination with Mummer, but I am wondering if there is an assembly-free way to do it too.

distance • 1.6k views
ADD COMMENT
2
Entering edit mode
10.7 years ago

A quick and dirty (and potentially computationally light) way to do this would be to compare the kmers present in the genomes.

3 pairwise metrics come to mind:

  1. The proportion of kmers that are present in both genomes
  2. The proportion of kmers that are unique to those genomes
  3. Some form of kmer depth difference metric (like sum of the square of the kmer depth differences)

Since this is done in a pairwise fashion, the number of comparisons would go up as the square of the number of genomes. With enough memory though, you can count the kmers in all the genomes first and put the info in a hash table and the comparisons should be fast to do.

ADD COMMENT
0
Entering edit mode

Thanks! These are really great ideas. I think I can do this in jellyfish...

ADD REPLY

Login before adding your answer.

Traffic: 2433 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6