If a string appears twice in column 1, select the higher value in column 2
2
0
Entering edit mode
5.8 years ago
drkennetz ▴ 560

Hi all,

I subsampled Illumina fastqs twice using 2 different random seeds. I then mapped all of the reads and got a total number of mapped reads and a mapped percent.

For the same sample, I am trying to select the higher map%, as I have found for the vast majority of my samples seqtk underestimates the actual map%.

The format of my tsv file is as follows:

sample1      200000      120000      60%
sample1      200000      115000      57.5%
sample2      200000      180000      90%
sample2      200000      190000      95%
...
sampleX      200000      180000      90%
sampleX      200000      182000      91%

I want to iterate over column 1 in the file, and select the line for data in column 3 or 4 (it doesn't matter) that is higher. So my example output from the above would be:

sample1      200000      120000      60%
sample2      200000      190000      95%
...
sampleX      200000      182000      91%

Looking forward to hearing your thoughts! Thanks

awk bash • 1.3k views
ADD COMMENT
4
Entering edit mode
5.8 years ago
ATpoint 81k

Removing the % from $4, then sorting $4 in descending numeric order and choosing only unique IDs of $1 gives the intended output. The advantage of this over any if/else iterative script that checks the lines after the current one for other occurrences of the same $1 is that the input is not required to be sorted in any fashion:

awk '{gsub("%",""); print}' test.txt | sort -k4,4rn | sort -k1,1 -u | awk 'OFS="\t", $4 =$4"%" {print}'
ADD COMMENT
0
Entering edit mode

worked like a charm! Thanks for the help.

ADD REPLY
0
Entering edit mode

You're very welcome!

ADD REPLY
0
Entering edit mode
5.8 years ago

with datamash. Added an extra row to check for sorting.

$ sed 's/%//g' test.txt | datamash  -sfg 1  max 4| cut --complement -f5 | sed 's/$/%/g'
sample1 200000  120000  90%
sample2 200000  190000  95%

$ cat test.txt 
sample1 200000  120000  60%
sample1 200000  115000  57.5%
sample2 200000  180000  90%
sample2 200000  190000  95%
sample1 200000  120000  90%
ADD COMMENT

Login before adding your answer.

Traffic: 1567 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6