Help with iTol annotation python script
1
0
Entering edit mode
5.0 years ago
heuristic • 0

Hello guys, I'm a beginner in bioinformatics and I have the following problem:

I have a input file like this:

Sequence_1,RT_RH_A17
Sequence_1,RH_A17
Sequence_1,INT_PHD
Sequence_1,PR_Ap2
Sequence_1,Gag
Sequence_1,RH_A17
Sequence_1,INT_rve
Sequence_2,INT_rve
Sequence_3,INT_rve
Sequence_4,INT_rve
Sequence_5,INT_rve
Sequence_6,RT_like
Sequence_6,RH_A17

It represents an annotation of transposon domain per sequence, as we can see, have several lines for the same sequence (e.g. sequence_1). I want to create an output file like this:

Sequence_1,1,1,1,1,1
Sequence_2,0,0,0,0,1
Sequence_3,0,0,0,0,1
Sequence_4,0,0,0,0,1 
Sequence_5,0,0,0,0,1
Sequence_6,0,0,1,1,0

The output represents a model to use in iTol (a tool to annotate phylogenetic trees). So I want the order of annotation as Gag,PR,RT,RH,INT, sequences that have one of those domains receive '1' in domain position or '0' if doesn't have.

I wrote a python script to try resolve this:

# open the table
domains=open('bel_domains.csv','r')
# create a output file
output = open('output.txt','w+')
# function to detect domains in lines
def found_domain(a,b):
    if a in b:
        output.write('1'+',')
    else:
        output.write('0'+',')

for line in domains:
    line_split = line.split(',')
    output.write(line_split[0]+',')
    found_domain('Gag',line_split[1])
    found_domain('PR',line_split[1])
    found_domain('RT',line_split[1])
    found_domain('RH',line_split[1])
    found_domain('INT',line_split[1])


output.close()

I know that I have a huge problem in logic and maybe a hugest problem in programming skills, because my output its:

Sequence_1,0,0,1,1,0,Sequence_1,0,0,0,1,0,Sequence_1,0,0,0,0,1,Sequence_1,0,1,0,0,0,Sequence_1,1,0,0,0,0,Sequence_1,0,0,0,1,0,Sequence_1,0,0,0,0,1...

Briefly, I need to detect wich lines are equal in the position of sequence name, and read&annotate domains in the following order (Gag,PR,RT,RH,INT), but I have no idea how I can implement this in my script.

Can someone help me?

Thanks!

python itol script • 1.4k views
ADD COMMENT
4
Entering edit mode
5.0 years ago
cschu181 ★ 2.8k
domain_order = ["Gag", "PR", "RT", "RH", "INT"]

domains = dict()
for line in open("bel_domains.csv"):
    seq, domain_info = line.strip().split(",")
    domains.setdefault(seq, set()).update(domain_info.split("_"))

with open("output.txt", "wt") as out:
    for seq in sorted(domains):
        contains_domains = [int(dom in domains[seq]) for dom in domain_order]
        print(seq, *contains_domains, sep=",", file=out)
ADD COMMENT
0
Entering edit mode

Thanks cschu181, it's work perfectly!

ADD REPLY

Login before adding your answer.

Traffic: 2732 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6