Question

Off topic:editing bed file with python

0

Entering edit mode

4.9 years ago

flogin ▴ 280

I have a file like this:

seq1,4205,6421
seq1,4205,6421
seq1,6367,7962
seq1,6367,7962
seq1,8527,9390
seq2,1612,4917
seq2,1612,4917
seq2,1612,4917
seq3,5813,6610
seq3,6676,8307
seq3,6676,8307

I want to remove redundancy names, and organize the output with the lower and greater values of each sequence, like this:

seq1,4205,9390
seq2,1612,4917
seq3,5813,8307

I create a python script to try to do this, using dictionaries (to convert in dataframe structure and csv in the final).

# -*- coding: utf-8 -*-
#!/usr/bin/env python3
import argparse as ag
parser = ag.ArgumentParser("This program receives as input a bed file with redundancy in sequence names, and different positions of each domain in the same sequence, and return a bed file without redundancy, and considered the lower and greater region as start and end respectively")
parser.add_argument("--infile",type=ag.FileType('r', encoding='UTF-8'),required=True,help="Input File")
args = parser.parse_args()
input_file = args.infile
dicti = {}
list_dicti = []
aux = "" # a quick fix to compare ID of each line
aux_2 = "" # a quick fix to compare ID of each line
start_1 = "" # used to compare values of start and end
start_2 = ""
end_1 = ""
end_2 = ""
for line in input_file: 
    if aux_2 == "": # in the first time, the aux_2 will receive the ID name
        aux_2 = line.strip().split(",")[0]  
    else:
        aux_2 = line.strip().split(",")[0]
    aux = line.strip().split(",")[0] # aux also receive the name of the sequence
    if aux == aux_2: #
        if start_1 == "" and end_1 == "":
            start_1 = line.strip().split(",")[1] 
            end_1 = line.strip().split(",")[2] 
        if start_2 == "" and end_2 == "":
            start_2 = line.strip().split(",")[1] 
            end_2 = line.strip().split(",")[2] 
        else:
            if start_2 < start_1:
                start_1 = start_2
            if end_2 > end_1:
                end_1 = end_2
    start_2 = line.strip().split(",")[1]
    end_2 = line.strip().split(",")[2] 
    if [d[aux] for d in list_dicti]:
        pass
    else:   
        dicti[aux]=start_1,end_1
        list_dicti.append(dicti)
    dicti = {}
    start_1 = ""
    start_2 = ""
    end_1 = ""
    end_2 = ""

for i in list_dicti:
    print(i)

But, my output is:

Traceback (most recent call last):
  File "bel_domains.py", line 38, in <module>
    if [d[aux] for d in list_dictio]:
  File "bel_domains.py", line 38, in <listcomp>
    if [d[aux] for d in list_dictio]:
KeyError: 'seq1'

My logic was as follow: Read each line of the archive, if the sequence name is the same between the lines, the sequence ID, the lower start value and the greater end value should be inserted in a dictionary, and the dictionary should be inserted in a list. So, to don't insert several lines of the same sequence, I put the lines

if [d[aux] for d in list_dicti]:
            pass

But, at that moment that it error occur.

Can anyone explain to me?

Best,

python bed redundancy • 799 views

ADD COMMENT • link updated 4.9 years ago by Zoomboi ▴ 40 • written 4.9 years ago by flogin ▴ 280