How to extract multiple bits of information that appear on different lines within the same text file
1
0
Entering edit mode
6.1 years ago

I am trying to extract the sequence ID and cluster number that occur on different lines within the same text file.

The input looks like

    >Cluster 72
    0   319aa, >O311_01007... *
    >Cluster 73
    0   318aa, >1494_00753... *
    1   318aa, >1621_00002... at 99.69%
    2   318aa, >1622_00575... at 99.37%
    3   318aa, >1633_00422... at 99.37%
    4   318aa, >O136_00307... at 99.69%
    >Cluster 74
    0   318aa, >O139_01028... *
    1   318aa, >O142_00961... at 99.69%
    >Cluster 75
    0   318aa, >O300_00856... *

The desired output is the sequence ID in one column and the corresponding cluster number in the second.

>O311_01007  72
>1494_00753  73
>1621_00002  73
>1622_00575  73
>1633_00422  73
>O136_00307  73
>O139_01028  74
>O142_00961  74
>O300_00856  75

Can anyone help with this?

Command line Coding Text processing • 1.1k views
ADD COMMENT
1
Entering edit mode

Kloetzl, you're an absolute legend. That worked perfectly.

Thank you very much.

ADD REPLY
4
Entering edit mode
6.1 years ago
kloetzl ★ 1.1k
cat foo.txt | awk '/^>/{n=$2}!/^>/{printf "%.11s %d\n", $3, n}'
ADD COMMENT

Login before adding your answer.

Traffic: 2480 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6