I need to create a gene2accession file where one gene ID might have one or more Uniprot entries. My file looks like this:
comp100002_c0 Q9FFI3
comp100004_c0 B9DHK3
comp100004_c0 F4J3J5
comp100005_c0 P54150
and I need it to look like this:
comp100002_c0 Q9FFI3
comp100004_c0 B9DHK3|F4J3J5
comp100005_c0 P54150
I tried a python script but it doesn't work and after tweaking the code around for a while I am quite stuck. Does anyone have a code that works to get this outcome? Thanks
my attempt in python:
f1 = open(sys.argv[1], 'rU')
lines = f1.readlines()
for i in range(0, len(lines)):
line = lines[i]
next_l = f1.next()
splitline = line.split('\t')
splitnext = next_l.split('\t')
if splitline[0] == splitnext[0]:
print splitline[0] + '\t' + splitline[1] + '|' + splitnext[1]
else:
print line
Thanks a million @EagleEye, I asked in stackoverflow and my question was blocked because it was 'unclear' what I was asking for. Note that to get rid of the last occurrence of '|' I used sed 's/BACK_SLASH(.*)|/\1/'. Thanks again