Question

Having some regex problems capturing strings with special chars. Could use some help.

0

Entering edit mode

3.5 years ago

a.j.wilson0000 ▴ 10

Having a bit of trouble reformatting this messed up run log. I want to remove the strings of characters that did not translate correctly from linux terminal stdout into the log file and then replace those string with a \t, a \n, or white space. Doing it for a large number of files, so I need a command line solution.

Log sample:

The following malformed strings repeat for every entry in the log:

^[[3J^[[H^[[2J^[[1;33m
^[[0m^[[0;33m
^[[0m^[[1;33m
^[[0m|^H/^H-^H^H
^[[1;37m
^[[0m^[[0;37m
^[[0m^[[1;37m
^[[0m^[[0;37m
^[[0m^[[0;37m^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^[[0m^[[0;37m
^[[0m^[[1;32m
^[[0m^[[0;32m

I've tried numerous gnu sed regexs to try to capture these with escaped special chars, but I keep getting 's/ ' unterminated errors (I think mainly due to that opening ^ in the strings?). Any pointers on how to go about doing this with sed or awk? Is there an easier way, perhaps with some sort of a find and replace python/perl script?

This is my current regex:

sed 's/\^\[\[3J\^\[\[H\^\[\[2J\^\[\[1;33m//g; s/\^\[\[0m\^\[\[0;33m//g; s/\^\[\[0m\^\[\[1;33m//g; s/\^\[\[0m|\^H\/\^H\-\^H\^H//g; s/\^\[\[1;37m//g; s/\^\[\[0m\^\[\[0;37m//g; s/\^\[\[0m//g; s/\^H//g; s/\^\[\[1;32m//g; s/\^\[\[0;32m//g' run.log > run_clean.log

sed awk regex • 1.7k views

ADD COMMENT • link updated 3.5 years ago by Jorge Amigo 14k • written 3.5 years ago by a.j.wilson0000 ▴ 10

0

Entering edit mode

I tried your command on a sample file and it worked for me.

Fatima-MacBook-Pro:~ Fatima$ cat tmp
^[[3J^[[H^[[2J^[[1;33m
^[[0m^[[0;33m
^[[0m^[[1;33m
^[[0m|^H/^H-^H^H
^[[1;37m
^[[0m^[[0;37m
^[[0m^[[1;37m
^[[0m^[[0;37m
^[[0m^[[0;37m^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^[[0m^[[0;37m
^[[0m^[[1;32m
^[[0m^[[0;32m

Fatima-MacBook-Pro:~ Fatima$ sed 's/\^\[\[3J\^\[\[H\^\[\[2J\^\[\[1;33m//g; s/\^\[\[0m\^\[\[0;33m//g; s/\^\[\[0m\^\[\[1;33m//g; s/\^\[\[0m|\^H\/\^H\-\^H\^H//g; s/\^\[\[1;37m//g; s/\^\[\[0m\^\[\[0;37m//g; s/\^\[\[0m//g; s/\^H//g; s/\^\[\[1;32m//g; s/\^\[\[0;32m//g' tmp

This link might help:

https://unix.stackexchange.com/questions/14684/removing-control-chars-including-console-codes-colours-from-script-output

ADD REPLY • link 3.5 years ago by Fatima ▴ 1000

0

Entering edit mode

Helpful to know it works for you and that my regex is at least correct. Something else is going wrong then I suppose.

Based on your suggestion about color codes, I think the answer might be due to the fact that sed is a stream editor and these are terminal ansi codes. If you cat the log file, the progress bar representations and colors show up as shown below.

https://pasteboard.co/JvYUOyh.png

So sed can't recognize the codes because it is essentially reading the file like cat.

ADD REPLY • link 3.5 years ago by a.j.wilson0000 ▴ 10

0

Entering edit mode

Is this a bioinformatics question?

ADD REPLY • link 3.5 years ago by Joe 21k

0

Entering edit mode

More of a raw data skills question sure. I'm working on a bioinformatics pipeline of mtdna deletion calling using eKLIPse deletion caller. So yes, it is related to bioinformatics in that I'm trying to clean up the eKLIPse logs.

ADD REPLY • link 3.5 years ago by a.j.wilson0000 ▴ 10

score 1 · Answer 1 · 2020-10-16

1

Entering edit mode

3.5 years ago

Jorge Amigo 14k

I can think of simplifying the regex a little bit using perl, in case it helps:

perl -pe 's/\^\[\[(2J|3J|H|[01](;3[237])?m)//g; s/\^H//g; s/\|\/-//' run.log > run_clean.log