Question

Transferring Gene Annotations from Assembly with Large Duplications

0

Entering edit mode

3.5 years ago

Charles Warden 8.2k

I would like to transfer some gene annotations between a previous and updated assembly (for 1 “chromosome,” derived from what was originally a circular BAC). There will be a mix of large and small changes in the underlying sequence.

I think this is difficult because there are exact and non-exact duplications.

For example, let’s say the annotations on the previous version of the sequence look something like this:

GeneA1-1
GeneB1-1
GeneB2-1
GeneA2-1
GeneB2-2
GeneC
GeneA1-2
GeneB1-2
GeneB2-3
GeneA2-2
GeneB2-4

In that example, if at all possible, I would like to keep track of GeneB2-2 being between GeneA2-1 and GeneC (and GeneB2-3 being between GeneB1-2 and GeneA2-2, etc.). The “large changes” include deletions, but not inversions (that I am aware of, at least not where I need to transfer annotations). So, you could imagine that possible changes might include a SNP in GeneA2-2, a deletion of GeneB2-3, etc..

My guess is that using Exonerate can help, but I am wondering if the is some parameter combination that can work better in this situation.

For example, I am already using --bestn 1 to report 1 hit per query (unless there are ties, which there are for the duplicated genes that are 100% matches), as part of the following command:

/opt/exonerate-2.2.0-x86_64/bin/exonerate --bestn 1 --showtargetgff true --model coding2genome --showalignment no $QUERY $TARGET > $RESULT

I have currently condensed the query file (for example, collapse GeneA1-1 and GeneA1-2 into 1 query sequence called GeneA1). However, genes that are not 100% identical can have the same overlapping Exonerate hits (for example, GeneB1-1 and GeneB2-1 may not be 100% identical, but they may be getting each other’s Exonerate hits). I think this might relate to Exonerate defining a different gene structure than used in the original annotation, and/or genes that were 99% identical in the last assembly became 100% identical in the revised assembly.

Also, if I run Exonerate on the exact same sequence that has the current gene annotations, then they are not all identical. For example, a gene with 4 exons in the current annotations might have 3 exons in the Exonerate hit in roughly the same region (again, on the exact same total genomic sequence, as a positive control).

To describe that previous point in another way, I used gffcompare for the output of Exonerate with the FASTA file for the reduced set of unique query sequences on the original positive control sequence and the original reference annotations. That reports a loss of 22.5% of exons (and addition of 7.5% of “novel” exons). So, if at all possible, I would like to see if this could be improved.

It might be hard to see, but here is a view of the tracks in IGV:

IGV view

The left hand side has a different class of genes that I am not predicting with Exonerate (and out of the 4 clusters will be deleted in the revised version of the sequence, as an example of a "large" deletion). However, there is at least 1 Exonerate hit that looks too big (about 3/4 of the way through the sequence, on the right side of the image) and I hope that gives some sense of how similar the current annotations are to the Exonerate hits.

I can also indirectly run Exonerate through MAKER. That provides some extra parameters that can be changed. However, at least currently, I think the GffCompare metrics are relatively more similar to the reference with Exonerate being run as described above.

I also figured out how to create a custom .chain file for liftOver (and use CrossMap to convert a GFF/GTF file with annotations), which I mention in this other post. However, I am losing 20-25% of the blocks from the original annotation, so I think I would need to change something. Plus, I think Exonerate is helping in terms of keeping track of the components of the gene annotation (all of the queries have at least 1 hit, but there are duplicates and the hits are not always identical to the original annotation).

Does anybody have any suggestions of what might help (either for Exonerate or something else)?

Exonerate MAKER liftOver assembly annotation • 1.8k views

ADD COMMENT • link 3.5 years ago by Charles Warden 8.2k

0

Entering edit mode

I am not sure how much it helps, but there is some additional background about this task here:

https://github.com/marbl/canu/issues/1841

ADD REPLY • link 3.5 years ago by Charles Warden 8.2k

0

Entering edit mode

While it didn't help for this specific project, there is also some additional discussion about the parameters to create a .chain file for liftOver / CrossMap here:

https://groups.google.com/a/soe.ucsc.edu/g/genome-mirror/c/zx665aZPEXU

ADD REPLY • link 3.5 years ago by Charles Warden 8.2k

0

Entering edit mode

3.5 years ago

Charles Warden 8.2k

It is not a complete answer, but adding --maxintron 2000 helped clean up that outlier gene. In the current test example, adding --percent 97 --maxintron 2000 doesn't seem to remove any additional hits, but the first loss of some hits for this example occurs with a changing the threshold from 97% to 98%. So, perhaps that could be good to keep in mind.

The GffCompare statistics were all a little better with the 97% threshold than the 98% threshold. With --percent 97 --maxintron 2000, the statistics are a little better than for the question (18.0% missed exons and 7.1% novel exons, for example).

It might be hard to see, but this is a visual way to look at some parameter changes:

Current Annotations and Exonerate Hits with Varying Parameters

Again, the annotations on the left side are not supposed to be annotated with Exonerate (and one of the 4 clusters will be deleted in the revised sequence, as one of the "large" deletions). However, this is a positive control test with the existing annotations.

I also tested --subopt FALSE and --refine region, but I don't think that helped in this situation

If I can provide additional feedback, then I will do so.

ADD COMMENT • link 3.5 years ago by Charles Warden 8.2k

score 1 · Accepted Answer · 2020-11-13

1

Entering edit mode

3.5 years ago

Charles Warden 8.2k

I was a bit surprised, but I think the protein query with protein2genome worked better than the coding query with coding2genome (even though different CDS sequences can produce the same protein).

Similar to my other answer, I used the --percent 97 --maxintron 2000 parameters. GffCompare statistics indicate some improvement, such as 9.4% missed exons and 0.4% novel exons:

protein2genome comparison

Again, the annotations on the left side are not supposed to be annotated with Exonerate (and one of the 4 clusters will be deleted in the revised sequence, as one of the "large" deletions). However, this is a positive control test with the existing annotation.

ADD COMMENT • link 3.5 years ago by Charles Warden 8.2k

0

Entering edit mode

Also, to provide a bit more justification for this strategy (even though it still isn't perfect and I would encourage additional feedback), I thought it might help to add some screenshots:

A couple examples for one class of gene:

YF example 1

YF example 2

An example for a different class of gene: YLec example 1

In other words, I would say you can see the individual exons are a better fit for Exonerate (using --bestn 1 --showtargetgff true --model protein2genome --showalignment no --percent 97 --maxintron 2000) than MAKER (at least with the settings that I used) or GENSCAN. For example, the exons for GENSCAN were different for all 3 of those examples, and the exons were either different or completely different for MAKER. There is an extra Exonerate exon/gene in the 3rd example, and that should be filtered to leave the overlapping annotation that is a better fit for the current annotation (on this positive control sequence, before revisions).

The GffCompare statistics also match that conclusion.

ADD REPLY • link 3.5 years ago by Charles Warden 8.2k

score 1 · Accepted Answer · 2020-11-19

I have also written some code to map coordinates based upon a .pileup file created from a BWA-Alignment (of the new sequence to the old sequence).

I don't think this will work if there are any inversions or complex structural variants. However, I think this may be OK if you just need to shift things due to indels (and you then have a way to check the resulting exons define valid genes).