Question

Please Help Me To Write This Python Script To Find The Number Of Homologous Gene Groups Within A Genome.

0

Entering edit mode

10.5 years ago

armahmud3 • 0

Dear all, please kindly help me to solve this problem by python script writing. I am a beginner in script writing.

Problem: A homologous gene group consists of genes, which have an alignment to at least one other gene in the group with both a % sequence coverage and % identity equal to or higher than a given stringency cutoff. Write a python program to find the number of homologous gene groups within a genome at various levels of stringency. For example with a stringency of 10, every gene in a group should have a % sequence coverage and % identity of at least 10% with at least one other gene in the group. The program should take a single genome file in genbank format on the command line and write a csv file called "output.csv" in the directory. The ouput file should consist of 10 lines, each line should have the stringency used (0 to 100, in increments of 10) and the number of homology groups. (See example below). write the python program as a txt file.

Example: 0,13456 10,234 20,234 30,200 40,190 50,187 60,187 70,100 80,95 90,55 100,45

please kindly help me. Thanks in advance.

python gene genbank • 6.6k views

ADD COMMENT • link updated 10.5 years ago by Whetting ★ 1.6k • written 10.5 years ago by armahmud3 • 0

0

Entering edit mode

Isn't this an exact duplicate of your last post?

ADD REPLY • link 10.5 years ago by Devon Ryan 104k

0

Entering edit mode

yes. sorry just edited the tag line of my post.

ADD REPLY • link 10.5 years ago by armahmud3 • 0

0

Entering edit mode

Oh wait, just read it again. Do you want the number of gene pairs which fulfill the criteria or the number of "single linkage clusters" according to the criteria? F.e.: if gene A and B are in a group and C and A also full criteria, they can be seen to form the set {A, B, C} or {A,B} {A,C}, do you want to count that as a 1 or 2 ( technically do you want all transitive extensions or not)?

ADD REPLY • link 10.5 years ago by Michael 54k

0

Entering edit mode

I will count that as 1.

ADD REPLY • link 10.5 years ago by armahmud3 • 0

0

Entering edit mode

Ok, treated in the pseudocode...

ADD REPLY • link 10.5 years ago by Michael 54k

0

Entering edit mode

could you provide some examples input file(s)

ADD REPLY • link 10.5 years ago by Whetting ★ 1.6k

score 4 · Answer 1 · 2013-10-13

This looks like a homework question. So you need to face our standard response: "What have you tried?", and otherwise motivation from others to write the complete script for you might be limited.

If you are an absolute beginner in programming, then this task might be a little over your head. I'd recommend to break down the task into smaller sub-tasks, which again I can give in pseudo code. Note that I do not know python, but this should help you to structure the task anyway. That is at least how I normally approach it (and that normally works for me quite well to structure my short scripts, and to spot the need for clarification in the definition ;) )

read genbank file 
declare count_vector := (0,...,0);
declare transitive_set_set_list := [{{}}]; # list of empty set of sets
   foreach pair of sequences {A,B}  in G (genbank file) do:
      run pairwise (reciprocal?) alignment  to get the alignment (I am not sure if that should be part of the task, but I guess so);           
      foreach set of filtering criteria i in 1:10 do:
         if alignment better than filtering criteria
             increment critera_count vector[i]
             transitive_set_set_list[i] := extend_transitive_sets({A,B}, transitive_set_set_list[i]); # this call handles the case of transitive extension
             # this is a bit more complicated
print count_vector
print transitive_set_set_list;

### note that this is almost python code thanks to pythons set datatype
subroutine extend_transitive_sets ({A,B}:set, S:set of sets ) returns set of sets
    locate sets Si1,2 in S: with A in Si1 and B in Si2                    
    Sreturn := {A,B} union Si1 union Si2 # extended sets, there should be max. 2 such nonempty sets, otherwise it's an error
    return (S setdiff {Si1,Si2} ) union Sreturn;

score 2 · Answer 2 · 2013-10-13

This uses a different approach but I think it should work...

1) Parse a GenBank file, and extract all the protein sequences. Save the sequences as a FastA file ([http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc12][1]).
2) Convert the fastA file to a BLAST database
3) Blast every protein in the fastA file against the custom database ([here][2])
4) Now parse the BLAST report (xml file), so that the match(es) fits your requirements (% sequence coverage and % identity of at least 10%)
5) format your output