Question

[SOLVED] manipulating text in linux enviroment

0

Entering edit mode

10.0 years ago

theodore ▴ 90

Dear all.

I do not know if a tool already exists but I would like to do the following steps in a tab delimited file:

The file is as follows:

chr1 0 600 15 Repetitive/CNV 0 . 0 600 245,245,245
chr1 1000 1600 8 Insulator 0 . 1000 1600 10,190,254
chr1 100004000 100005200 2 Weak Promoter 0 . 100004000 100005200 255,105,105
chr1 100005200 100016800 13 Heterochrom/lo 0 . 100005200 100016800 245,245,245
chr1 10001600 10014800 13 Heterochrom/lo 0 . 10001600 10014800 245,245,245
chr1 100016800 100022800 12 Repressed 0 . 100016800 100022800 127,127,127
chr1 100022800 100026800 13 Heterochrom/lo 0 . 100022800 100026800 245,245,245
chr1 100026800 100028600 12 Repressed 0 . 100026800 100028600 127,127,127
chr1 100028600 100037000 13 Heterochrom/lo 0 . 100028600 100037000 245,245,245
chr1 100037000 100046600 12 Repressed 0 . 100037000 100046600 127,127,127
chr1 100046600 100046800 6 Weak Enhancer 0 . 100046600 100046800 255,252,4
chr1 100046800 100047000 2 Weak Promoter 0 . 100046800 100047000 255,105,105
chr1 100047000 100047200 4 Strong Enhancer 0 . 100047000 100047200 250,202,0
chr1 100047200 100047400 6 Weak Enhancer 0 . 100047200 100047400 255,252,4
chr1 100047400 100054200 13 Heterochrom/lo 0 . 100047400 100054200 245,245,245
chr1 100054200 100055000 12 Repressed 0 . 100054200 100055000 127,127,127
chr1 100055000 100087400 13 Heterochrom/lo 0 . 100055000 100087400 245,245,245
chr1 100087400 100087600 6 Weak Enhancer 0 . 100087400 100087600 255,252,4

first I would like to remove the number space before the characterization of the area: 6 Weak Enchancer ---> Weak Enchancer second to count all Weak enchancer or other identical fields of row 4 and print something like the following: Weak Enchancer 4 Heterochrom 20 . . . I tried: sort 'file.bed' | awk '{print $4}' | uniq -c -D -i or sort 'file.bed' | uniq -c -D -i with no avail. Any help will be higly appreciated

I should state that I want to do it as easily as possible, I have no real skills in programming and even if openoffice can do it I'm fine with that!!!

Thank you in advance

Theodore

awk sed Pipeline File-Manipulation • 2.8k views

ADD COMMENT • link updated 2.6 years ago by Ram 43k • written 10.0 years ago by theodore ▴ 90

1

Entering edit mode

10.0 years ago

chefer ▴ 350

Using your original file, you can do the unique count (col 5 in the example) like this:

cut -f 5 test.bed | sort | uniq -c

Which gives you this:

6 Heterochrom/lo
1 Insulator
1 Repetitive/CNV
4 Repressed
1 Strong Enhancer
3 Weak Enhancer
2 Weak Promoter

edit: I misunderstood the example format, this will not work on the input dataset. You should cut on column 4, and then do the replace as suggested by the upvoted post.

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by chefer ▴ 350

1

Entering edit mode

It's understandably unclear from the question, but 2 Weak Promoter appears to be an example of a single column value, rather than being split into two columns. It appears you split things into different columns when you made the file on your local system.

ADD REPLY • link 10.0 years ago by Devon Ryan 104k

0

Entering edit mode

It is a tab delimited file the 4th row consists of (number)(space)(description). The description is with space values.

I do not know how to make a tab look like in a copy paste manner.

ADD REPLY • link 10.0 years ago by theodore ▴ 90

0

Entering edit mode

Some ways to paste or represent tab in the terminal discussed here.

ADD REPLY • link 10.0 years ago by Neilfws 49k

0

Entering edit mode

I'll try it tomorrow. Thank you

ADD REPLY • link 10.0 years ago by theodore ▴ 90

0

Entering edit mode

10.0 years ago

theodore ▴ 90

cut -f 9 | sed 's/[0-9]* //' | sort | uniq -c | sed 's/* //' | awk '{print $2,$3"\t"$1}'

I got it thanks to your recommendations.

The above command/pipeline worked miracles for me

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by theodore ▴ 90

Ram · Accepted Answer · 2014-05-08

4

Entering edit mode

10.0 years ago

Devon Ryan 104k

cut -f 4 file.bed | sed 's/[0-9]* //' | sort | uniq -c

or something along those lines.

Edit: For the sake of clarity, the cut -f 4 portion extracts column 4 and the sed command just replaces a number followed by a space with nothing (i.e., it removes that pattern).

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by Devon Ryan 104k

0

Entering edit mode

I've run the pipeline, it works great, although I get the following:

45 10_Txn_Elongation
    133 11_Weak_Txn
     95 12_Repressed

it seems as if sed had replaced spaces (\s) with underscore (_)???

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by theodore ▴ 90

0

Entering edit mode

It shouldn't do that, at least not unless you changed it to be something like sed 's/ /_/'.

ADD REPLY • link 10.0 years ago by Devon Ryan 104k