Question

How Does One Describe All Differences Between Two Complete Chromosomes?

11

Entering edit mode

12.9 years ago

Aaronquinlan 12k

The scenario. Assume you have an accurate and staggeringly cheap sequencing technology that allows you to sequence and assemble human chromosomes to (near) completion - say under 100 large contigs. Now say you have the distinct pleasure of doing this for multiple individuals.

Let's permit another leap and assume that you have a really fantastic aligner that allows you to precisely align (pairwise) the chromosomes from two or more samples in mere minutes.

The problem. Fantastic --- now, what "grammars" exist for describing the differences between any two chromosomes from two different individuals (e.g., chr1 from Jim-Bob and Mary-Sue)? The grammars must account for SNPs, INDELs, and chromosomal rearrangements (e.g. inversions, duplications, deletions, insertions, translocations). The closest thing I know of are CIGAR strings, but they don't allow for changes in strand, duplications, etc.

Surely one must exist, but I can't find it. Any suggestions of literature to read?

comparative structural • 2.7k views

ADD COMMENT • link updated 10.7 years ago by Biostar 20 • written 12.9 years ago by Aaronquinlan 12k

0

Entering edit mode

Nice thought experiment but if such a thing existed how would it be used? i.e. what would its advantage be compared to using a gff or assembly style file based on a common reference chromosome.

ADD REPLY • link 12.9 years ago by Alastair Kerr 5.3k

0

Entering edit mode

The assumption is that there will soon be many "reference" genomes and that a newly-sequenced genome will ultimately be higher quality than the reference.

ADD REPLY • link 12.9 years ago by Aaronquinlan 12k

0

Entering edit mode

Does it matter how good the 'reference' is? If it is compared against everything surely the worst that will happen is that there will be many common 'differences' to the reference between samples.

ADD REPLY • link 12.9 years ago by Alastair Kerr 5.3k

0

Entering edit mode

Assume you have 100 completely assembled genomes in addition to the reference. Your lab is interested in gene X. How would you precisely compare each allele of gene X among every genome, while accounting for duplications, inversions, etc., of the gene and up/downstream sequence? Multiple alignments would keep things "registered", but it lacks a robust framework for necessary tools to do the comparisons. Or perhaps I am missing something obvious that already exists (a common occurrence, hence the question).

ADD REPLY • link 12.9 years ago by Aaronquinlan 12k

0

Entering edit mode

Agreed: copy number variants are problematic, but +1 for the cortex assembler answer below

ADD REPLY • link 12.9 years ago by Alastair Kerr 5.3k

score 6 · Answer 1 · 2011-06-07

6

Entering edit mode

12.9 years ago

lh3 33k

For a pair of genomes, the closest that exists is UCSC's chain format (you use that when you do liftOver). Nonetheless, it does not describe mismatches. You can imagine a simple extension to put mismatches in the format.

For large scale relationship, AGP is the standard.

For multiple genomes, I think that is going to be a bidirectional graph, described in Gene Myers' string graph paper, Velvet paper, Mike Brudno's JCB paper and Jared Simpson's string graph paper, among a few others. Cortex assembler is the closest following this line.

ADD COMMENT • link 12.9 years ago by lh3 33k

0

Entering edit mode

Thanks for the great references, Heng.

ADD REPLY • link 12.9 years ago by Aaronquinlan 12k

Ram · Answer 2 · 2011-06-07

0

Entering edit mode

12.9 years ago

Pierre Lindenbaum 161k

Your question reminds me Michael Barton's current project: Scaffoler http://next.gs/man/scaffolder-format and What Improvements Would You Recommend For This Genome Scaffolding Software?

One could imagine to use this software to build Jim's DNA from Mary's DNA.

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 12.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks Pierre, but it seems to me that Scaffolder is meant to describe ambiguities in the known structure of a single chromosome, not among multiple chromosomes. At least that is the interpretation I made from reading the "Getting Started" guide. Do you know otherwise?

ADD REPLY • link 12.9 years ago by Aaronquinlan 12k