How Does One Describe All Differences Between Two Complete Chromosomes?
2
11
Entering edit mode
12.9 years ago

The scenario. Assume you have an accurate and staggeringly cheap sequencing technology that allows you to sequence and assemble human chromosomes to (near) completion - say under 100 large contigs. Now say you have the distinct pleasure of doing this for multiple individuals.

Let's permit another leap and assume that you have a really fantastic aligner that allows you to precisely align (pairwise) the chromosomes from two or more samples in mere minutes.

The problem. Fantastic --- now, what "grammars" exist for describing the differences between any two chromosomes from two different individuals (e.g., chr1 from Jim-Bob and Mary-Sue)? The grammars must account for SNPs, INDELs, and chromosomal rearrangements (e.g. inversions, duplications, deletions, insertions, translocations). The closest thing I know of are CIGAR strings, but they don't allow for changes in strand, duplications, etc.

Surely one must exist, but I can't find it. Any suggestions of literature to read?

comparative structural • 2.7k views
ADD COMMENT
0
Entering edit mode

Nice thought experiment but if such a thing existed how would it be used? i.e. what would its advantage be compared to using a gff or assembly style file based on a common reference chromosome.

ADD REPLY
0
Entering edit mode

The assumption is that there will soon be many "reference" genomes and that a newly-sequenced genome will ultimately be higher quality than the reference.

ADD REPLY
0
Entering edit mode

Does it matter how good the 'reference' is? If it is compared against everything surely the worst that will happen is that there will be many common 'differences' to the reference between samples.

ADD REPLY
0
Entering edit mode

Assume you have 100 completely assembled genomes in addition to the reference. Your lab is interested in gene X. How would you precisely compare each allele of gene X among every genome, while accounting for duplications, inversions, etc., of the gene and up/downstream sequence? Multiple alignments would keep things "registered", but it lacks a robust framework for necessary tools to do the comparisons. Or perhaps I am missing something obvious that already exists (a common occurrence, hence the question).

ADD REPLY
0
Entering edit mode

Agreed: copy number variants are problematic, but +1 for the cortex assembler answer below

ADD REPLY
6
Entering edit mode
12.9 years ago
lh3 33k

For a pair of genomes, the closest that exists is UCSC's chain format (you use that when you do liftOver). Nonetheless, it does not describe mismatches. You can imagine a simple extension to put mismatches in the format.

For large scale relationship, AGP is the standard.

For multiple genomes, I think that is going to be a bidirectional graph, described in Gene Myers' string graph paper, Velvet paper, Mike Brudno's JCB paper and Jared Simpson's string graph paper, among a few others. Cortex assembler is the closest following this line.

ADD COMMENT
0
Entering edit mode

Thanks for the great references, Heng.

ADD REPLY
0
Entering edit mode
12.9 years ago

Your question reminds me Michael Barton's current project: Scaffoler http://next.gs/man/scaffolder-format and What Improvements Would You Recommend For This Genome Scaffolding Software?

One could imagine to use this software to build Jim's DNA from Mary's DNA.

ADD COMMENT
0
Entering edit mode

Thanks Pierre, but it seems to me that Scaffolder is meant to describe ambiguities in the known structure of a single chromosome, not among multiple chromosomes. At least that is the interpretation I made from reading the "Getting Started" guide. Do you know otherwise?

ADD REPLY

Login before adding your answer.

Traffic: 1889 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6