CD-hit records matching and parsing
0
0
Entering edit mode
5.1 years ago
mnmalash • 0

I have a CD-hit result file (for who is familiar with CD-hit). I want to paste the second column value from another file which is a 2-column tab delimited table into the CD-hit file next to the respective matching RUN ID (like that highlighted with green in the sample image). RUN ID is the 1st column in the tab delimited table (below).sample image for CD-hit file

CD-hit result file (first file)

>Cluster 0
0   108nt, >ERR123456.1016542.1... *
1   108nt, >ERR123456.3114223.2... at +/93.52%
2   108nt, >ERR345678.217087.1... at -/89.81%
3   108nt, >ERR345678.291581.2... at -/92.59%
4   108nt, >ERR567890.3381351.2... at +/87.96%
5   108nt, >ERR987654.126640.2... at -/86.11%
6   108nt, >ERR987654.2492930.2... at +/84.26%
7   108nt, >ERR987654.3327702.1... at +/92.59%
>Cluster 1
0   108nt, >ERR876543.626414.2... *
1   108nt, >ERR123456.3213598.2... at +/85.19%
2   108nt, >ERR567890.1158706.2... at +/97.22%
3   108nt, >ERR345678.146372.1... at -/88.89%
4   108nt, >ERR765432.201531.2... at -/92.59%
5   108nt, >ERR765432.2770540.1... at -/87.04%

Tab-delimited table (second file)

 ERR123456   1650
 ERR345678   2350
 ERR567890   1520
 ERR876543   4520
 ERR987654   3960
 ERR765432   2550

I want the output file to contain the values in the 2nd column from the tab delimited table next to the line that contain its respective RUN ID (1st column in table).

>Cluster 0
0   108nt, >ERR123456.1016542.1... *             1650  #matching RUN ID
1   108nt, >ERR123456.3114223.2... at +/93.52%   1650
2   108nt, >ERR345678.217087.1... at -/89.81%    2350
3   108nt, >ERR345678.291581.2... at -/92.59%    2350
4   108nt, >ERR567890.3381351.2... at +/87.96%   1520
5   108nt, >ERR987654.126640.2... at -/86.11%    3960
6   108nt, >ERR987654.2492930.2... at +/84.26%   3960
7   108nt, >ERR987654.3327702.1... at +/92.59%   3960
>Cluster 1
0   108nt, >ERR876543.626414.2... *              4520
1   108nt, >ERR123456.3213598.2... at +/85.19%   1650
2   108nt, >ERR567890.1158706.2... at +/97.22%   1520
3   108nt, >ERR345678.146372.1... at -/88.89%    2350
4   108nt, >ERR765432.201531.2... at -/92.59%    2550
5   108nt, >ERR765432.2770540.1... at -/87.04%   2550

I would be thankful too if someone told me how after this matching to extract each cluster in a discrete file having the name of the cluster

bash python cd-hit text-processing • 1.7k views
ADD COMMENT

Login before adding your answer.

Traffic: 2517 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6