Question

orthofinder software results

0

Entering edit mode

4.9 years ago

mxlsherry1992 ▴ 80

Dear all, I got some confused when analyzing the Orthofinder results. I have three RNA-sequencing data for A, B,C species (A has refenrence genome and B,C haven't have a reference genome), My aim is to find the genes existed in B, C species, but absent in A species. I think Orthofinder should be work for this, after Orthofinder analysis, I got several output, one is a file called "Orthogroups.GeneCount.csv", I am not sure is this is the one that I need? The file is looks like this:

             A  B   C   Total
OG0000000   28  30  13  71
OG0000001   26  31  1   58
OG0000002   6   49  0   55
OG0000003   13  40  0   53
OG0000004   18  16  18  52
OG0000005   29  19  4   52
OG0000006   18  33  0   51
OG0000007   4   46  0   50
OG0000008   28  18  4   50

I assume that OG0000002, OG0000003, OG0000006, OG0000007 is the gene that I need ?(but I am not condident if I am right or not..),

And there is another file called "Orthogroups.csv", I am confused is it tell us the correspondence for OG0000000 number and the ID in original input protein file?

Or if there is any other output file for orthofinder...(there are bunch of output files)..

Thanks in advance for your suggestions and have a great day!!

RNA-Seq Assembly rna-seq next-gen • 2.0k views

ADD COMMENT • link updated 4.9 years ago by lieven.sterck 15k • written 4.9 years ago by mxlsherry1992 ▴ 80

score 0 · Answer 1 · 2019-05-15

0

Entering edit mode

4.9 years ago

lieven.sterck 15k

Not exactly.

the OG<number> is the ID of the gene-families or ortho-cluster. for each of those you can look up in the Orthogroups.csv file what the exact geneIDs are form the genes in that specific group/cluster.

The Orthogroups.GeneCount.csv file simply provides the overall summary of the number of genes for each species present in a cluster

All this is actually nicely explained on their webpage though (hint ;-) )

ADD COMMENT • link 4.9 years ago by lieven.sterck 15k

0

Entering edit mode

Thanks for your kindly reply!! I think I made a mistake for the input file of Orthofinder then results in too much sequence in output file. for the data without reference genome, I just use cd-hit and transdecoder to get the protein file, do I need to use "get_longest_isoform_seq_per_trinity_gene.pl" as well to get the unigene? (I am not sure if cd-hit and get_longest_isoform_seq_per_trinity_gene.pl are both necessary to get a unigene, or I just need one of them..) Will be really appreciated If you also have any suggestions on that.

Thank u!!

ADD REPLY • link 4.9 years ago by mxlsherry1992 ▴ 80

0

Entering edit mode

cd-hit will resolve some of the redundancy but will likely not have that much effect as what trinity already outputs should be more or less non-redundant (at least on technical/sequence level ) . running the cd-hit equivalent for proteins might help a little as well.

getting one isoform per 'locus' will help indeed, so running this perl script (don't know it to be honest) could do the trick.

Keep in mind though that comparing transcriptome data with genomic data is always tricky and can result in 'strange' results due to the inherent nature of those data types.

ADD REPLY • link 4.9 years ago by lieven.sterck 15k