Question

Get The Most Frequent Highest And Lowest Values

0

Entering edit mode

10.6 years ago

Mchimich ▴ 320

Hello everybody, I have a table a bellow with Chromosome, position and occurence sorted by position.

chr01    8755    2
chr01    8848    15
chr01    8908    1
chr01    8912    2
chr01    8920    1
chr01    9198    19
chr01    9268    19
chr01    11299    1
chr01    11598    1
chr01    11605    1
chr01    11610    1
chr01    11656    1
chr01    11680    1
chr01    11692    1
chr01    11727    1
chr01    11750    1
chr01    11761    9
chr01    11776    4

I would like to get to get automatically the most frequent lowest and highest values : 8848 and 11761 Does anyone have a idea about how can I do that using perl or bash on a linux platform.

Thanks in advance for your help !

perl statistics • 5.1k views

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 10.6 years ago by Mchimich ▴ 320

1

Entering edit mode

You should clarify what you mean by most frequent lowest and highest values. It looks like 1 should be the lowest occurrence value and 19 should be your highest?

ADD REPLY • link 10.6 years ago by Damian Kao 16k

1

Entering edit mode

I don't understand what are the rules for getting 8848 and 11761. Could you explain why it's not 8755 or 11776? If this is because you take into account occurrence, why it's not 9198 and 9268 ? Thanks for clarification.

ADD REPLY • link 10.6 years ago by Manu Prestat 4.1k

0

Entering edit mode

Sorry, maybe I was not very clair about that. I would like to select 8848 because even if it's less frequent than 9198 and 9268 it's smallest and most frequent than 8755 (which is the min value). The same logic will be true zith 11761. Please let me know if it's still not clear. Thanks guys

ADD REPLY • link 10.6 years ago by Mchimich ▴ 320

1

Entering edit mode

I still don't know what the criteria are for choosing these positions. Why is 8755 the minimum value? You mean minimum base position?

ADD REPLY • link 10.6 years ago by Damian Kao 16k

0

Entering edit mode

Thanks Damian to take time for this ! the selection must verify tow important criteria the first one is the poisition and the scond is the occuerence of this position. It this case the minimum value (column 2) is 8755. But it's not the most frequent ! So I would like to select 8848 because it's the smallest AND most frequent values in my data. And this will be the same for the bigest values. Hope that I'm more clear.

ADD REPLY • link 10.6 years ago by Mchimich ▴ 320

2

Entering edit mode

You should rethink these criteria. Your criteria of smallest by base position and most frequent by occurrence does not allow you to distinguish between 8848 and 9198 unless you weigh one criteria over the other somehow. And if you do choose to weigh position over occurrence, you need to justify why you are doing that.

ADD REPLY • link 10.6 years ago by Damian Kao 16k

1

Entering edit mode

Like Damian said, you should give some weight. It seems your acceptable range for high or low value in 1000 bp, is that right. and like manu said, why it's not 8755 or 11776? why it's not 9198 and 9268 ? what is that black magic. BTW, do you use R, R may be the best way to do these things than scripting languages.

ADD REPLY • link 10.6 years ago by ananta.acharya ▴ 20

0

Entering edit mode

I have a csv file containing three fields :

scaffold No., Chromosome No, frequency of scaffold in chromosome(i.e for eg Os01)

I want to filter scaffold with high frequency in respective chromosome in new filtered file..

Any suggestions?

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by aaarsh88 • 0

0

Entering edit mode

I think you should post it as a separate question & show some data for people to understand what you are asking for.

ADD REPLY • link 9.6 years ago by komal.rathi ★ 4.1k

score 1 · Answer 1 · 2013-09-19

1

Entering edit mode

10.6 years ago

Kenosis ★ 1.3k

Perhaps the following will be helpful:

use strict;
use warnings;

my ( %hash, $lowPos, $highPos );

while (<>) {
    my ( $position, $frequency ) = (split)[ 1, 2 ];
    $hash{$position} = $frequency;
}

for my $pos ( sort { $a <=> $b } keys %hash ) {
    if ( !defined $lowPos ) {
        $lowPos = $pos;
        next;
    }

    if ( $hash{$pos} > $hash{$lowPos} ) {
        $lowPos = $pos;
        last;
    }
}

for my $pos ( sort { $b <=> $a } keys %hash ) {
    if ( !defined $highPos ) {
        $highPos = $pos;
        next;
    }

    if ( $hash{$pos} > $hash{$highPos} ) {
        $highPos = $pos;
        last;
    }
}

print "Lowest, most frequent position : $lowPos\n";
print "Highest, most frequent position: $highPos\n";

Usage: perl script.pl inFile [>outFIle]

The last, optional parameter directs output to a file.

Output on your dataset:

Lowest, most frequent position : 8848
Highest, most frequent position: 11761

This uses a hash to store position/frequency info as key/value pairs. To obtain the lowest, high-frequency position, it numerically sorts the keys in ascending order, then 'looks' for the first frequency value higher than the first position's value. It then does a similar procedure for the highest, high-frequency position.

Hope this helps!

ADD COMMENT • link 10.6 years ago by Kenosis ★ 1.3k

0

Entering edit mode

Hi Kenosis and thanks for your reply. I test your solution and it works just fine for the example that I porivide. But unfortunatly when I tried will a different exemple :

chr01    8947    15
chr01    8972    10
chr01    9075    1
chr01    9101    2
chr01    9172    12
chr01    9505    1
chr01    11554    1
chr01    11751    1
chr01    11947    1
chr01    12174    1
chr01    12350    1
chr01    12415    2
chr01    12504    1
chr01    12606    1
chr01    12792    1
chr01    12898    1
chr01    13081    1
chr01    13355    3
chr01    13381    3
chr01    13415    1
chr01    13444    1
chr01    13585    18
chr01    13610    1

The result is :

Lowest, most frequent position : 13585

Highest, most frequent position: 13585

instead of :

Lowest, most frequent position : 8947

Highest, most frequent position: 13585

Please do you have an idea about that ? Tnaks again

ADD REPLY • link 10.6 years ago by Mchimich ▴ 320

0

Entering edit mode

All that script is doing is sorting the rows by position and then finding the position that has more occurrences than the first position. You really need to re-think your criteria because it doesn't work for all cases.

ADD REPLY • link 10.6 years ago by Damian Kao 16k

score 0 · Answer 2 · 2013-09-19

0

Entering edit mode

10.6 years ago

Ashutosh Pandey 12k

If you will look at some basic unix commands you should be able to do it yourself. If your file is a tab delimited file, then try

cut -f2 yourfile.txt | sort -n will print the second column in increasing order

cut -f2 yourfile.txt | sort -n | head -1 will print the smallest value

cut -f2 yourfile.txt | sort -n | tail -1 will print the highest value

ADD COMMENT • link 10.6 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Thanks for yor reply, I give here one exemple, but the line number of the most abundant sequence can change ! your solution will work perfectly for that particular case.

ADD REPLY • link 10.6 years ago by Mchimich ▴ 320

0

Entering edit mode

Yup i didnt read it carefully. But in order to answer your question you first need to answer Manu's question. Thanks.

ADD REPLY • link 10.6 years ago by Ashutosh Pandey 12k

score 0 · Answer 3 · 2013-09-19

0

Entering edit mode

10.6 years ago

Manu Prestat 4.1k

To get min and max in a specific column using awk:

cat table.txt | awk 'min=="" || $2 < min {min=$2} END{ print min}'
cat table.txt | awk 'max=="" || $2 > min {max=$2} END{ print max}'