How to order a gff3 file by coordinates
3
0
Entering edit mode
6.2 years ago

I have discovered that my gff3 file is not in order at the time of defining the gene, mRNA and CDS. An example

 LG1     phytozomev10    gene    10835748        10846741        .        -       .       ID=gene00257-v1.0-hybrid.v1.1;Name=gene00257-v1.0-hybrid
    LG1     phytozomev10    mRNA    10835748        10846741        .       -       .       ID=mrna00257.1-v1.0-hybrid.v1.1;Name=mrna00257.1-v1.0-hybrid;pacid=27244575;longest=1;Parent=gene00257-v1.0-hybrid.v1.1
    LG1     phytozomev10    CDS     10846566        10846741        .       -       2       ID=mrna00257.1-v1.0-hybrid.v1.1.CDS.1;Parent=mrna00257.1-v1.0-hybrid.v1.1;pacid=27244575
    LG1     phytozomev10    CDS     10841035        10841272        .       -       0       ID=mrna00257.1-v1.0-hybrid.v1.1.CDS.2;Parent=mrna00257.1-v1.0-hybrid.v1.1;pacid=27244575
    LG1     phytozomev10    CDS     10840828        10840916        .       -       2       ID=mrna00257.1-v1.0-hybrid.v1.1.CDS.3;Parent=mrna00257.1-v1.0-hybrid.v1.1;pacid=27244575
    LG1     phytozomev10    CDS     10839072        10839109        .       -       0       ID=mrna00257.1-v1.0-hybrid.v1.1.CDS.4;Parent=mrna00257.1-v1.0-hybrid.v1.1;pacid=27244575
    LG1     phytozomev10    CDS     10837291        10838461        .       -       1       ID=mrna00257.1-v1.0-hybrid.v1.1.CDS.5;Parent=mrna00257.1-v1.0-hybrid.v1.1;pacid=27244575
    LG1     phytozomev10    CDS     10836538        10836623        .       -       0       ID=mrna00257.1-v1.0-hybrid.v1.1.CDS.6;Parent=mrna00257.1-v1.0-hybrid.v1.1;pacid=27244575
    LG1     phytozomev10    CDS     10835748        10835776        .       -       1       ID=mrna00257.1-v1.0-hybrid.v1.1.CDS.7;Parent=mrna00257.1-v1.0-hybrid.v1.1;pacid=27244575
    LG1     phytozomev10    gene    10862624        10876720        .       +       .       ID=gene00258-v1.0-hybrid.v1.1;Name=gene00258-v1.0-hybrid
    LG1     phytozomev10    mRNA    10862624        10876720        .       +       .       ID=mrna00258.1-v1.0-hybrid.v1.1;Name=mrna00258.1-v1.0-hybrid;pacid=27244449;longest=1;Parent=gene00258-v1.0-hybrid.v1.1
    LG1     phytozomev10    CDS     10862624        10862667        .       +       0       ID=mrna00258.1-v1.0-hybrid.v1.1.CDS.1;Parent=mrna00258.1-v1.0-hybrid.v1.1;pacid=27244449
    LG1     phytozomev10    CDS     10862746        10863050        .       +       1       ID=mrna00258.1-v1.0-hybrid.v1.1.CDS.2;Parent=mrna00258.1-v1.0-hybrid.v1.1;pacid=27244449
    LG1     phytozomev10    CDS     10863146        10863223        .       +       2       ID=mrna00258.1-v1.0-hybrid.v1.1.CDS.3;Parent=mrna00258.1-v1.0-hybrid.v1.1;pacid=27244449
    LG1     phytozomev10    CDS     10864316        10864463        .       +       2       ID=mrna00258.1-v1.0-hybrid.v1.1.CDS.4;Parent=mrna00258.1-v1.0-hybrid.v1.1;pacid=27244449
    LG1     phytozomev10    CDS     10864616        10864850        .       +       1       ID=mrna00258.1-v1.0-hybrid.v1.1.CDS.5;Parent=mrna00258.1-v1.0-hybrid.v1.1;pacid=27244449

antonio@PC-DESPACHO:/mnt/d/Dropbox/Fvesca Anotacion/annotation$ cat Fvesca_226_v1.1.gene.gff3 | grep "LG1" | tail

    LG1     phytozomev10    CDS     8224242 8224301 .       +       0       ID=mrna35195.1-v1.0-hybrid.v1.1.CDS.5;Parent=mrna35195.1-v1.0-hybrid.v1.1;pacid=27245751
    LG1     phytozomev10    gene    8551588 8551849 .       -       .       ID=gene35196-v1.0-hybrid.v1.1;Name=gene35196-v1.0-hybrid
    LG1     phytozomev10    mRNA    8551588 8551849 .       -       .       ID=mrna35196.1-v1.0-hybrid.v1.1;Name=mrna35196.1-v1.0-hybrid;pacid=27245617;longest=1;Parent=gene35196-v1.0-hybrid.v1.1
    LG1     phytozomev10    CDS     8551817 8551849 .       -       0       ID=mrna35196.1-v1.0-hybrid.v1.1.CDS.1;Parent=mrna35196.1-v1.0-hybrid.v1.1;pacid=27245617
    LG1     phytozomev10    CDS     8551588 8551713 .       -       0       ID=mrna35196.1-v1.0-hybrid.v1.1.CDS.2;Parent=mrna35196.1-v1.0-hybrid.v1.1;pacid=27245617
    LG1     phytozomev10    gene    8554333 8555118 .       +       .       ID=gene35197-v1.0-hybrid.v1.1;Name=gene35197-v1.0-hybrid
    LG1     phytozomev10    mRNA    8554333 8555118 .       +       .       ID=mrna35197.1-v1.0-hybrid.v1.1;Name=mrna35197.1-v1.0-hybrid;pacid=27245093;longest=1;Parent=gene35197-v1.0-hybrid.v1.1
    LG1     phytozomev10    CDS     8554333 8554716 .       +       0       ID=mrna35197.1-v1.0-hybrid.v1.1.CDS.1;Parent=mrna35197.1-v1.0-hybrid.v1.1;pacid=27245093
    LG1     phytozomev10    CDS     8554791 8554854 .       +       0       ID=mrna35197.1-v1.0-hybrid.v1.1.CDS.2;Parent=mrna35197.1-v1.0-hybrid.v1.1;pacid=27245093
    LG1     phytozomev10    CDS     8554946 8555118 .       +       2       ID=mrna35197.1-v1.0-hybrid.v1.1.CDS.3;Parent=mrna35197.1-v1.0-hybrid.v1.1;pacid=27245093

As you can see, the first gene/mRNA starts by the coordinate 10835748, and contains a certain number of exons. But in the same file, you can see in the TAIL of the same gff3 file, there is another gen/mRNA starting by 8224242. This gff3, in turn, is fully disordered

I would like to order the gff3 by coordinates, by preserving in each case the gene, mRNA lanes, and the presence of their corresponding CDS (exons) of each of the gene which is an information contained in each of the CDS lanes

gff3 ordering • 6.4k views
ADD COMMENT
1
Entering edit mode
5.7 years ago
cmdcolin ★ 3.8k

There is the GFF3Sort tool from https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1930-3

This tool allows "topological sort" allowing gene to be above mRNA to be above it's subfeatures.

There is also GenomeTools which can do

gt gff3 -sortlines yourfile.gff > yourfile.sorted.gff

This does not guarantee topological sort

Then there is the GNU sort mentioned by lieven.sterck but this is not intrinsically topological either unless that extra options like sorting on column 3 induce that

All three methods should be compatible with tabix though, and tabix doesn't require topological sort

ADD COMMENT
0
Entering edit mode

The genometools command gt gff3 -sort yourfile.gff > yourfile.sorted.gff is also available and uses a slightly different algorithm than sortlines. Could be of interest

ADD REPLY
0
Entering edit mode
6.2 years ago

depending on the keys that are present in your gff file this oneliner will get you a long way:

sort -k1,1V -k4,4n -k5,5rn -k3,3r some.gff > some.sorted.gff

might still need a little tweaking though

ADD COMMENT
0
Entering edit mode
7 months ago
alejandrogzi ▴ 120

Hi! I recently developed this tool: gtfsort, a chr/pos/feature GTF2.5-3 sorter using a using a lexicographically-based index ordering algorithm. Currently accepts only 2.5 and 3 GTF formats but you can easily tweak the code to match your features.

ADD COMMENT

Login before adding your answer.

Traffic: 2550 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6