Question

Mauve - Trouble understanding the .islands file

0

Entering edit mode

7.7 years ago

Iarwain ▴ 10

Hi everybody,

For my project I'm comparing the genomes of an outbreak strain with strains from it's own species. My goal is to identify possible unique regions in the outbreak strain. I use Mauve to compare the genomes.

As I read the Mauve official website, the .islands file (among the output from Mauve) seems to be the most interesting. It describes the regions where one or more genomes have an unique region where the others do not. However, I'm still having trouble fully understanding the file. The makers of Mauve, Darling lab, explains the file in the following way: link (Scroll down to the "The Islands file" paragraph).

My question is this: How can one derive unique regions from a file like this? It describes where two genomes align but that means it is, per definition, not unique right? Is it something I'm missing or simply not understanding?

mauve output • 2.1k views

ADD COMMENT • link 7.7 years ago by Iarwain ▴ 10

0

Entering edit mode

I would read this article and look at 'unique' there.

progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement

http://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0011147

See this part and the next section from the article, for example

"To minimize compute time and focus anchoring coverage on single-copy regions, our method only extends seeds that are unique in two or more genomes. By default, we use seed patterns with weight equal to . This formula is also applied to determine the appropriate seed weight during recursive anchoring (Figure 2 step 5, described later), with the restriction that in all cases. The resulting local multiple alignments are ungapped and always align a contiguous subsequence of two or more genomes in . Any given local multiple alignment can be described formally by its length and vector of integers: , where is a signed left-end coordinate of the LMA in , or 0. When takes on a value of 0, the genome is absent from all of .

The LMAs found by our procedure are ungapped alignments of unique subsequences and thus are similar to multi-MUMs, but may contain mismatches according to the palindromic seed patterns. As with multi-MUMs, any portion of a unique LMA may be non-unique and no LMA may be completely contained within the boundaries of another LMA. We refer to the set of local multiple alignments generated in this step as . An example is given in Figure 2 step 1."

Next section:

"In summary, this scoring scheme assigns high scores to well-conserved regions that are unique in each genome and does not consider gap penalties."

ADD REPLY • link 7.7 years ago by natasha.sernova ★ 4.0k