Hi all,
I am trying to identify gene of known sequence among set of coding sequences predicted with Augustus. I have two candidate genes, which have an identical sequences after I reverse compliment one of them (100% similarity) and they are on the opposite strands. I don't quite understand the concept of overlapping genes and reverse/forwards strands: if my candidate genes are reverse complimentary to each other and they're present on the opposite strands, does it mean that in fact it can be one gene, just Augustus prediction for forward and complimentary strain is overlapping for this gene?
Thanks a lot, Agata
Just to clarify these two predictions are on different strands in two separate locations?
yes, they are within the same scaffold, to give you an overview:
g1 start:719963 end:721093 strand:-
g2 start:535611 end:536738 strand:+
Moreover, I am investigating a variation in the gene of interest. Looking from my alignment visualisation there is a variant in the corresponding positions on both of them (meaning that it's placed in the same distance from the start of one gene as the distance from the end of the other gene) and variants are complimentary to each other. Would it also support the theory that it's the same gene?
Chances of there being two copies of the gene seem to be small since they share a varation in the same relative position, while being physically apart. Is this a novel genome? Could the assembly be incorrect? Is the region around the genes similar?
Yes, the scaffolds were assembled but the genome is highly repetitive and assembly is quite challenging. So would you suggest that this is more of a result of assembly issues?
In fact, BLAST shown also two short 'genes' (228b) in a close distance to those with variation, which also have very high similarity but do the other end of the query genes. I was trying to extract the whole regions containing short+long gene+intergenic region between them for both: forward and reverse strand genes and those regions also are 100% identical, including intergenic region. Is it possible that it's a prediction error and in fact it's a one long gene but again due to the poor assembly there is a gap in prediction?