Biostar Beta. Not for public use.
Orthogroups.csv file for orthofinder
0
Entering edit mode
16 months ago

Dear all,

To interpret the orthofinder output file Orthogroups.csv, if I have three input protein fasta file, the output Orthogroups.csv is like below, the first two species have no reference genome, so its' ID looks like"Trinity_DN_...", since the ID has similar format for the first two species( Clarias, Pan), how could I identify if they are from the first species (Clarias) or third species (Pan)..

1
Entering edit mode

if the IDs used in each set are not unique you likely will run into trouble (I'm already surprised that blast did not complain on this?). Before running orthofinder it's a good idea to prefix the IDs from each set with a 'code' that indicates the species it's from.

1
Entering edit mode

I think OrthoFinder does the conversion before running BLAST for you, for example in the WorkingDirectory I got:

$head SpeciesIDs.txt SequenceIDs.txt ==> SpeciesIDs.txt <== 0: Athaliana.fasta 1: Bdistachyon.fasta 2: Hvulgare.fasta 3: Osativa.fasta 4: Pglaucum.fasta 5: Sbicolor.fasta 6: Sitalica.fasta 7: Zmays.fasta ==> SequenceIDs.txt <== 0_0: AT1G50920.1 | Symbols: | Nucleolar GTP-binding protein | chr1:18870555-18872570 FORWARD LENGTH=671 0_1: AT1G36960.1 | Symbols: | unknown protein; BEST Arabidopsis thaliana protein match is: unknown protein (TAIR:AT1G48095.1); Has 54 Blast hits to 54 proteins in 2 species: Archae - 0; Bacteria - 0; Metazoa - 0; Fungi - 0; Plants - 54; Viruses - 0; Other Eukaryotes - 0 (source: NCBI BLink). | chr1:14014796-14015508 FORWARD LENGTH=181 0_2: AT1G44020.1 | Symbols: | Cysteine/Histidine-rich C1 domain family protein | chr1:16716692-16718656 REVERSE LENGTH=577 0_3: AT1G15970.1 | Symbols: | DNA glycosylase superfamily protein | chr1:5486544-5488494 REVERSE LENGTH=352 0_4: AT1G73440.1 | Symbols: | calmodulin-related | chr1:27611418-27612182 FORWARD LENGTH=254 0_5: AT1G75120.1 | Symbols: RRA1 | Nucleotide-diphospho-sugar transferase family protein | chr1:28197022-28198656 REVERSE LENGTH=402 0_6: AT1G17600.1 | Symbols: | Disease resistance protein (TIR-NBS-LRR class) family | chr1:6053026-6056572 REVERSE LENGTH=1049 0_7: AT1G51380.1 | Symbols: | DEA(D/H)-box RNA helicase family protein | chr1:19047960-19049967 FORWARD LENGTH=392 0_8: AT1G77370.1 | Symbols: | Glutaredoxin family protein | chr1:29073916-29074642 FORWARD LENGTH=130 0_9: AT1G44090.1 | Symbols: ATGA20OX5, GA20OX5 | gibberellin 20-oxidase 5 | chr1:16760677-16762486 REVERSE LENGTH=385$ grep '^>' Species0.fa | head
>0_0
>0_1
>0_2
>0_3
>0_4
>0_5
>0_6
>0_7
>0_8
>0_9

2
Entering edit mode
16 months ago
SMK ♦ 1.3k
Ghent, Belgium

In the newer version of OrthoFinder (here for example 2.3.1), several output files become tab delimited (Change file endings to .tsv as appropriate).

And in the output file Orthogroups.tsv, the members in each family from different input sequence files are separated by a tab:

    Athaliana   Hvulgare    Osativa Pglaucum    Sbicolor    Sitalica    Zmays
OG0010401   AT1G09410.1, AT1G56690.1    HORVU4Hr1G052340.1  LOC_Os03g20190.1    Pgl_GLEAN_10026176      Seita.9G424600.1.p  Zm00001d028935_P001


By using the newer version, the members of your first two species (Clarias, Pan) will be separated by a tab and appear in the second and third columns of "Orthogroups.tsv", so you can identify them by selecting a specific column regardless of the naming.

1
Entering edit mode
16 months ago
david_emms • 50

Hi

Just following on from what SMK said, the Orthogroups.csv file was also a tab-delimited file. Genes from difference species are separated by a tab and genes within the same species are separated with a comma. If you open it in a spreadsheet program (e.g. Excel, LibreOffice Calc) and chose 'tab' as the delimiter then it will display correctly.

All the best David