Question

Get the top X number of lines per unique value in one column, once you've sorted a text file using 'sort'

0

Entering edit mode

3.3 years ago

niccolo.alfano • 0

Hi everybody,

I have a text file with 19 columns (divided by tab) which I have sorted using a command such as:

sort -t$'\t' -k1,1 -k11,11g -k12,12gr -k3,3g file > file_sorted

Now I would like to keep the top X number of lines per unique value in column 1. I know that if I do:

sort -u -k1,1 --merge file_sorted > file_sorted_merged

I will keep only the 1st line for each unique value in column 1. How can I keep the top X (for example, the top 5) lines for the same value in column 1 from the sorted file?

Thanks a lot in advance

bash shell sort • 1.1k views

ADD COMMENT • link updated 3.3 years ago by GenoMax 141k • written 3.3 years ago by niccolo.alfano • 0

0

Entering edit mode

EDIT: You should edit your question and add how this is related to bioinformatics, or the post might be closed as off-topic.

Either switch to something with more in-memory state, like R or python, or use sub-shells. The sub-shell will pick X unique values per column and then you can use awk to pick N matches per input line from the sub-shell.

There would be an awful lot of trial and error and column-specific wrangling if you use awk, so I'd recommend using R.

ADD REPLY • link 3.3 years ago by Ram 43k

score 5 · Accepted Answer · 2021-01-02

5

Entering edit mode

3.3 years ago

shenwei356 8.4k

I got a tool csvtk, the uniq command can do exactly what you want , check the last example.

csvtk uniq -t -f 1 -n 5

The behind logic is easy, use a map/hash-table (column value -> count) to track how many times you have met a row with cerntain value in the column you care. If <= N, print this line.

ADD COMMENT • link 3.3 years ago by shenwei356 8.4k

0

Entering edit mode

Cool! it does exactly what I was looing for! thanks a lot

ADD REPLY • link 3.3 years ago by niccolo.alfano • 0

0

Entering edit mode

I've moved shenwei's comment to an answer. Please accept it so the post is marked as solved.

ADD REPLY • link 3.3 years ago by Ram 43k