Question

How to verify the information stored in a GFF file

1

Entering edit mode

9.5 years ago

iraun 6.2k

Hi all,

I've just created a new GFF off a new version of a genome X using liftOver tool. I would like to know any method to verify the content stored in this new GFF. I mean, I'm looking for a way to compare the old GFF information with the new gFF one. For example:

- See if the genes have changed.
- See if the coordinates of a old transcript have changed (due to indels in my new genome). Check that the new coordinates correspond actually with the old transcript (same fasta sequence).
...

I'm thinking about extracting transcript sequences from both GFF (old and new), and blast them against each other to see the length of the alignment, see the differences between both versions...etc.

If anyone has done this kind of analysis (or if anyone has any idea about this), I would like to hear new opinions/suggestions.

Thanks :)

P.S: I don't want to validate the GFF format, I want to check the information (content) of the new annotation in comparison to the old one.

GFF Genome Annotation • 4.3k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by iraun 6.2k

Ram · Answer 1 · 2014-10-29

1

Entering edit mode

9.5 years ago

Cytosine ▴ 460

I'd suggest to just use "diff", for comparing the coordinates in the 2 files.

You can go with something like:

diff $GFF1 $GFF2 --suppress-common-lines --side-by-side

That will get you a nicely formatted input of differing lines. If you get no output then the files are similar.

For comparing sequences, your idea of using blast seems reasonable.

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Cytosine ▴ 460

0

Entering edit mode

Thanks Cytosine. Yes, I know diff command, and you're right, I can use it to see differences between files. But, I would like to find a way to check that those differences, are really differences of my new genome reference. That's the reason of doing blast.. or another kind of approach that I would like to know :)

ADD REPLY • link 9.5 years ago by iraun 6.2k

Ram · Answer 2 · 2014-11-18

The ParsEval program from the AEGeAn Toolkit was designed to be a kind of "diff" for GFF3 files. It has both text and HTML/PNG output modes, the later of which has genome-browser-style graphics showing the gene structures of both GFF3 files.

However, ParsEval expects the reference sequences for both files to be identical. So if you're looking at annotations for two different versions of an assembly (seems to be the case since you mentioned liftOver) then it may not be the appropriate tool.

score 0 · Answer 3 · 2016-08-02

You can use gffread to extract, in fasta format, the CDS and/or mRNA sequences under the old and new gene models.

The fastas can then be profitably be compared with diff (after sorting, using, say, fasort).

They _should_ be the same.

Except....

liftOver might generate ill-formed gff however. As the liftOver program usage statement says:

-gff File is in gff/gtf format. Note that the gff lines are converted separately. It would be good to have a separate check after this that the lines that make up a gene model still make a plausible gene after liftOver

In my experience, you will get un-plausable gff3 output if, for example, the first and last exon liftOver to the same scaffold, but an interior exon liftsOver to some other scaffold. There are probably other ways of generating unplausible gff output.

You might choose to follow this sage advice and first convert gene model from gff3 to genePred format (say, using gff3ToGenePred), and liftOver the genePred instead. This will "ensure your genes are lifted over as a unit."

If you are intent upon lifting over your gff3 (since, say, it has column 9 attributes you wish to preserve), you can first liftOver the genePred and then filter out the failed gene models from the gff3.

This strategy is implemented in the following bash function:

 function liftOverGff3Lint () { 
    local old=$1
    local chain=$2
    local new=$3
    local unMapped=$4
    gff3ToGenePred ${old} ${old}.genePred
    liftOver -genePred ${old}.genePred ${chain} ${new}.genePred ${unMapped}.genePred
    liftOver -gff <(grep -v -f <(grep -v '^#' ${unMapped}.genePred | cut -f 1 | perl -p -e '$_="ID=${_}PARENT=${_}"' | tee ${unMapped}.genePred.id ) ${old}) ${chain} ${new} ${unMapped}
 }

In my experience, after removing unplausible gff as above, the remaining gene models produce identical .fasta files when generated as suggested above.