Pan and core Genome prediction from OrthoMCL output
1
0
Entering edit mode
7.3 years ago
Eva_Maria ▴ 190

Hi biostars

I did Orthomcl clustering with 7 genome and I got more than 9000 clusters. How will I find Pan,core Genome and unique genes from OrthoMCL output.

Is there any scripts or tool for this?

core-genome orthomcl mcl pangenome • 3.2k views
ADD COMMENT
2
Entering edit mode
7.3 years ago
Joe 21k

Roary basically does all this for you these days... however I did some OrthoMCL work a while back that I recently revisited to find core/soft core/ and accessory genomes.

Here's a script that might help. It finds clusters which are core only (i.e. a cluster that contains only those lines which have a single ortholog from each of you genomes (7 in your case)). NB, this ignores paralogs. In my project, a core gene had to appear 100% of the time. Larger scale pangenome studies will sometimes let you specify that it has to be present in (e.g) 95% of genomes - but I'd argue that is meaningless on low numbers of genomes.

Once you have those which are core, it has an inverse operation where you can provide it a list of core orthologs to ignore, and by definition, that's you accessory genome. Fair warning though, the script wasn't designed to be all-purpose :P so don't just assume its working perfectly, I wrote it peicewise as I was working through my own project in the way that seemed easiest to me at the time!

EDIT This wont pick out unique genes for you, but that's easy, you'd just find all the clusters which have only 1 ortholog (and orthoMCL inherently dumps them at the end of the file anyway).

It requires you to create some helper files though. First you'll need a "keyfile" which contains the orthoMCL compliant names you have used. I.e. given clusters that look like this:

OG_3587: AG49|04470 AG51|00275 AGBD|01460 AGHT|00124 AGJN|03725 AGKC|03784 AGNP|02778 AGTI|01216 LG01|04968 LG33|00741 LG45|02256
OG_3588: AG49|04471 AG51|00276 AGBD|01461 AGHT|00125 AGJN|03726 AGKC|03785 AGNP|02779 AGTI|01217 LG01|04969 LG33|00742 LG45|02257

My "CoreKeyFile.txt" looks like this:

AG49
AG51
AGBD
AGHT
AGJN
AGKC
AGNP
AGTI
LG01
LG33
LG45

You can then invoke the script with python coregeneparser.py -i [clusterfile] -n [keyfile]

#!/usr/bin/python

# Script to parse orthoMCL/orthAgogue gene clusters and extract just those that are core

import sys
import argparse
import traceback
import warnings

##################

# Get a list of strings to iterate over
def getKeys(nameFile):
        with open(nameFile, "r") as namehandle:
                names = []
                for line in namehandle:
                        strip = line.rstrip('\n')
                        names.append(strip)

        return names

#################
def main():
###################################################################################################
# Parse command line arguments
        try:
                parser = argparse.ArgumentParser(description='Script pulls out core genes from OrthoMCL outputs. i.e. it finds lines where each identifier is found once only.')
                parser.add_argument(
                        '-i',
                        '--infile',
                        action='store',
                        help='The gene cluster file to parse')
                parser.add_argument(
                        '-n',
                        '--names',
                        action='store',
                        help='A file of compliant "keys" which correspond to the genomes from which the clusters are derived.')
                parser.add_argument(
                        '-d',
                        '--disregard',
                        action='store',
                        help='A second file of keys to specifically disregard. Lines containing these will be thrown out of the matched list.')

                args = parser.parse_args()
        except:
                print "An exception occured with argument parsing. Check your provided options."
                traceback.print_exc()

        nameFile = args.names
        ignoreFile = args.disregard
# Main code:
        keysToFind = getKeys(nameFile)
        if ignoreFile is not None:
                keysToDisregard = getKeys(ignoreFile)

        print('Looking for:')
        print(keysToFind)
        if ignoreFile is not None:
                print('and ignoring:')
                print(keysToDisregard)

        matchedLines = []
        with open(args.infile, "r") as clusterFile:
                matchedLines = [line for line in clusterFile if all([line.count(keyToFind) == 1 for keyToFind in keysToFind])]
                if ignoreFile is not None:
                        matchedLines = [line for line in matchedLines if all([line.count(keyToDisregard) == 0 for keyToDisregard in keysToDisregard])]

        for each in matchedLines:
                print each.rstrip('\n')

if __name__ == "__main__":
        main()
ADD COMMENT

Login before adding your answer.

Traffic: 1540 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6