extract columns from file based on header selected from 2nd file
3
1
Entering edit mode
8.4 years ago
joaofadista ▴ 50

I have one large text table, tab separated. The first row is a header. I then have a second text file that contains a subset of the headers in the first file. I want to extract all the columns of the first file whose header is contained in the list given in the second file. Here an example of the inputs and the desired output:

DATA.TXT

head0 head1 head2 head3 head4
1 25 1364 22 13
2 10 215 1 22

LIST.TXT

head0  
head4

Desired output:

head0 head4  
1 13  
2 22
text delimited • 12k views
ADD COMMENT
0
Entering edit mode

Can we have a few actual examples of the header names? It's easy if they do actually end 0, 1, 2, 3 ... but I assume they don't?

ADD REPLY
0
Entering edit mode

How can i do it in R? Thanks

ADD REPLY
0
Entering edit mode

How can i do it in R? Thanks

ADD REPLY
4
Entering edit mode
8.4 years ago
iraun 6.2k

Just with a simple google search a hundred of questions/answers related to this topic can be easily found, specially, in stackoverflow: http://unix.stackexchange.com/questions/114829/display-column-that-has-a-specific-string

I'd suggest you to try to solve this kind of problems searching on internet and trying yourself before asking. In this case, you can get what you want with a little modification to the solution proposed on the link:

awk -F'\t' 'NR==FNR{arr[$1]++;next}{for(i=1; i<=NF; i++) if ($i in arr){a[i]++;}} { for (i in a) printf "%s\t", $i; printf "\n"}' file2 file1
ADD COMMENT
0
Entering edit mode

This works perfectly. However, if LIST.TXT contains headers which do not exist in DATA.TXT, say headX and headY, and their order is "head0, headX, head4, headY", can the output include null values for headX and headY columns and also print output in the same order as in LIST.TXT file?

ADD REPLY
1
Entering edit mode
8.4 years ago
george.ry ★ 1.2k

Ok, regardless of my comment above ... two bash commands should see you done ...

paste <(seq 1 $(head -1 DATA.TXT | wc -w))  <(head -1 DATA.TXT | tr '\t' '\n') > TMP
cat DATA.TXT | cut -f$(grep -wFf LIST.TXT TMP | cut -f1 | sed ':a;N;$!ba;s/\n/,/g') > OUTPUT

It's the sort of thing that's easier in R or something, though!

// EDIT // On a Mac (useless BSD rather than GNU tools!) apparently you'd have to alter the sed command there to ...

sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/,/g'
ADD COMMENT
0
Entering edit mode

Thanks George, but it outputs this error message:

cut: option requires an argument -- 'f'
ADD REPLY
0
Entering edit mode
8.4 years ago

if you know the columns in advance, and depending upon the delimitator, you can simply use a cut command. If tabs are used, in this case you select the first and fifth column

cat DATA.TXT | cut -f 1,5 > output.file

This requires you to reveal the desired columns. If this is easy to check, see how easy the code is

If the delimiting is different to a tab, such as an empty space

cat DATA.TXT | cut -d " " -f 1,5 > output.file

I dare to put this here because of the simplicity of this code

ADD COMMENT

Login before adding your answer.

Traffic: 2458 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6