Hello BioStars,
A fun learning opportunity was presented to me by an obstacle in the road to completing my transcriptomics project. The underlying question is: How do you EFFICIENTLY find the longest sequence for a large set of orthologous gene family groups (Orthogroups)?
Two disclaimers: 1.) I am a coding monolinguist (for now, just py!) 2.) I have a little computer science background
To start off, after running OrthoFinder to cluster a reference transcriptome into Orthogroups, I have a couple files important to this task.
orthogroups.csv - Column 0 has the name of Orthogroup, Column 1 has a list of transcripts from the reference transcriptome that have clustered together to form that orthogroup. There are roughly 64,000 orthogroups.
reference_transcriptome.fasta - Roughly 439,000 fasta sequences from several related species.
FIRST APPROACH: For each orthogroup in orthogroups.csv; I got a list of transcripts that needed to be compared. For each transcript; I went into reference_transcriptome.fasta, found the sequence For that sequence; I measured the length using Seq.IO from biopython and stored these for comparison. Print name of orthogroup and longest transcript found by comparing stored sequence/seq length pairs
After watching this brute force approach go for a day or two with little progress I rethought the problem...
SECOND APPROACH: 1.) Using SeqIO from Biopython, I created a table with the name of each header in reference_transcriptome.fasta and it's sequence length.
2.) For each Orthogroup
longestseq_len = 0
longestseq_name = ""
for each transcript;
access transcript header in table
if length of transcript >= longestseq_len;
length = longestseq_len
header = longestseq_name
print name_of_orthogroup + longestseq_name
This second approach works in a "reasonable" amount of time, but I'd bet that there is a method out there that can accomplish this in minutes if not seconds.
If anybody has insight about how to do this faster, please let me know!
Additionally, any tips (for posting, communicating code ideas, comments about absent necessary info/ present unnecessary info) would be greatly appreciated.
Thanks, Alex