Question

How to extract multiple bits of information that appear on different lines within the same text file

0

Entering edit mode

6.1 years ago

timothy.ghaly ▴ 10

I am trying to extract the sequence ID and cluster number that occur on different lines within the same text file.

The input looks like

    >Cluster 72
    0   319aa, >O311_01007... *
    >Cluster 73
    0   318aa, >1494_00753... *
    1   318aa, >1621_00002... at 99.69%
    2   318aa, >1622_00575... at 99.37%
    3   318aa, >1633_00422... at 99.37%
    4   318aa, >O136_00307... at 99.69%
    >Cluster 74
    0   318aa, >O139_01028... *
    1   318aa, >O142_00961... at 99.69%
    >Cluster 75
    0   318aa, >O300_00856... *

The desired output is the sequence ID in one column and the corresponding cluster number in the second.

>O311_01007  72
>1494_00753  73
>1621_00002  73
>1622_00575  73
>1633_00422  73
>O136_00307  73
>O139_01028  74
>O142_00961  74
>O300_00856  75

Can anyone help with this?

Command line Coding Text processing • 1.1k views

ADD COMMENT • link 6.1 years ago by timothy.ghaly ▴ 10

1

Entering edit mode

Kloetzl, you're an absolute legend. That worked perfectly.

Thank you very much.

ADD REPLY • link 6.1 years ago by timothy.ghaly ▴ 10

score 4 · Accepted Answer · 2018-03-26

4

Entering edit mode

6.1 years ago

kloetzl ★ 1.1k

cat foo.txt | awk '/^>/{n=$2}!/^>/{printf "%.11s %d\n", $3, n}'

ADD COMMENT • link 6.1 years ago by kloetzl ★ 1.1k