Perl :Shuffle An Array 10 Times, Computing 10 Average Maxes, Printing The Mean Max Average. Repeat This Entire Process 1000 Times.
1
0
Entering edit mode
11.3 years ago
Neal ▴ 60

Hello all,

This is my first post here, but I will try to explain the programming problem as best as I can.

I have a data set which looks like the following

NR_046018    DDX11L1    ,    0    0    1    1    1    1    1    1    1    1    0    0    0    0    1.44    2.72    3.84    4.92
NR_047520    LOC643837    ,    3    2.2    0.2    0    0    0.28    1    1    1    1    2.2    4.8    5    5.32    5    5    5    5    3
NM_001005484    OR4F5    ,    2    2    2    1.68    1    0.48    0    0.92    1    1.8    2    2    2    2.04    3.88    3
NR_028327    LOC100133331    ,    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0

What is needed

  1. Shuffle the array 10 times. After _each_ shuffle, divide the array into 2 new arrays, say set1 and set2.
  2. From each new array, compute maximum average of each row of numbers.
  3. Get 10 maximum averages of each set1 and set2. Compute the average of the 10 maximum averages obtained for each set, let's call it 10avg1 and 10avg2.
  4. Get a list of 1000 10avg2 and 100010avg2`.

Code

use warnings;
use List::Util 'shuffle';
use List::Util qw(max);

my $file = 'mergesmall.txt';

open my $fh,'<',$file or die "Unable to open file";
open OUT,">Shuffle.out" or die;

my @arr = <$fh>;

my $i=10;
while($i){
    my @arr1 = ();  #Intitialize 1st set
    my @arr2 = ();  #Initialize 2nd set

    my @shuffled = shuffle(@arr);

    push @arr1,(@shuffled[0..1]); #Shift into 1st set
    push @arr2,(@shuffled[2..3]); #Shift into 2nd set



    foreach $_(@arr1){
        my @val1 = split;
        my $max1 = max(@val1[3..$#val1]);

         $total1 += $max1;
         $num1++;
    }

    my $average_max1 = $total1 /  $num1;
    #print "\n\n","Average max 1st set is : ",$average_max1;
    print OUT "Average max 1st set is : ",$average_max1;

         foreach $_(@arr2){
        my @val2 = split;
        my $max2 = max(@val2[3..$#val2]);

        print "\n\n";

         $total2 += $max2;
         $num2++;
    }

    my $average_max2 =  $total2 /  $num2;
    #print "\n\n","Average max 2nd set is : ",$average_max2;
    print OUT "\n","Average max 2nd set is : ",$average_max2,"\n\n";


    $i--;

}

The Problem

The code I have been able to write so far can get 10 maximum averages of each set1 and set2. I am not able to figure out how to compute the average of these 10 maximum averages. If I can figure out this, I can easily put a for loop to run 1000 times and obtain 1000 10avgset1 and 1000 10avgset2

Points to Note

  1. The actual data set has each row comprising a maximum of 400 numbers, some rows have less than that, some have none at all, but never more than 400.
  2. The actual dataset has 41,382 rows. Set1 will comprise of 23,558 rows and set2 will comrpise of 17,824 rows.
  3. File is a .txt file and all the numbers in each row are tab delimited.
perl • 5.0k views
ADD COMMENT
0
Entering edit mode

Could you please explain what the application to bioinformatics is? It looks like you are doing some resampling here? Did I get it right that you you want to compute maximum of the averages, not maximum and averages?

ADD REPLY
0
Entering edit mode

@Michael Hello Michael! Thank you for your comment. This data is a small part of ChIP-Seq data for K562 cell line which I've been given to analyze. Yes we are doing some resampling here, we are trying to generate a control set actually. And thank you for asking for a clarification, I think I should reframe the question. I need to compute the average maximum for all rows. So for example, I find the maximum value in NR046018, which is 4.92 here. Similarly for NR047520(5.32) and so on for all the rows(23,558 in set1) and (17,824 in set2). Once these maximum values are found, I need to find what is the average maximum.

ADD REPLY
0
Entering edit mode

And since we are trying to generate a control set, I have to shuffle the main data(one which has 41,382 rows. This main dataset was generated by combining two pre-existing datasets 1 and 2). So for each shuffle, we divide the new shuffled array into 2 new arrays, compute average maximum for each of those new sets, and we shuffle 10 times , obtaining 10 average maximums for each set. So now, we have 10 average maximums for set 1 and similarly for set 2. (I have been able to do it this far) From these 10 average maximums, I need to find the mean average. And then this process of 10 shufflings neds to be repeated 1000 times, so I have 1000 mean averages. I hope I was able to explain myself a little better...

ADD REPLY
3
Entering edit mode
11.3 years ago
SES 8.6k

This question is a bit tricky for several reasons.

  1. The data is oddly formatted so you need to do more than just slurp the whole file into an array (this is almost always the wrong thing to do).
  2. There are an unequal number of values for each line. That means you will need to think about how to select/sort values equally per line/measurement.
  3. It is not clear to me why you are using this selection algorithm, so I won't attempt to code that part.

Concerning 1), you probably just want those values right? Slurping the whole file and shuffling is likely not doing what you want. Here is one way to get the values:

#!/usr/bin/env perl

use v5.10; # make sure we have at least Perl 5.10 so we can use the feature 'say'
use strict;
use warnings;
use Data::Dump;
use Text::CSV;
use List::Util qw(sum max shuffle);

my $csv = Text::CSV->new({sep_char => "    "});

my @cols;

while (<DATA>) {
    chomp;
    my ($ids, $values) = split(/\,    /, $_);
    if ($csv->parse($values)) {
        @cols = $csv->fields();
        say "Data:          ",join(" ",@cols);
        say "Shuffled data: ",join(" ",shuffle(@cols));
        # Here you can select/sort the shuffled data how you like
        say "Mean:          ",mean(@cols);
        say "Max:           ",max(@cols);
        say "";
    }
    else {
        my $err = $csv->error_input;
        say "Failed to parse line: $err";
    }
}

sub mean { return @_ ? sum(@_) / @_ : 0 }

__DATA__
NR_046018    DDX11L1    ,    0    0    1    1    1    1    1    1    1    1    0    0    0    0    1.44    2.72    3.84    4.92
NR_047520    LOC643837    ,    3    2.2    0.2    0    0    0.28    1    1    1    1    2.2    4.8    5    5.32    5    5    5    5    3
NM_001005484    OR4F5    ,    2    2    2    1.68    1    0.48    0    0.92    1    1.8    2    2    2    2.04    3.88    3
NR_028327    LOC100133331    ,    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0

If we call this "biostars_59584.pl" then:

$ perl biostars_59584.pl
Data:          0 0 1 1 1 1 1 1 1 1 0 0 0 0 1.44 2.72 3.84 4.92
Shuffled data: 1 2.72 1 1 0 1.44 4.92 1 0 1 0 1 1 0 0 3.84 0 1
Mean:          1.16222222222222
Max:           4.92

Data:          3 2.2 0.2 0 0 0.28 1 1 1 1 2.2 4.8 5 5.32 5 5 5 5 3
Shuffled data: 5 1 1 0.2 2.2 3 1 5.32 5 3 5 1 4.8 5 0 0.28 0 5 2.2
Mean:          2.63157894736842
Max:           5.32

Data:          2 2 2 1.68 1 0.48 0 0.92 1 1.8 2 2 2 2.04 3.88 3
Shuffled data: 2.04 3 0 1 2 2 0.92 2 1 2 0.48 2 2 3.88 1.8 1.68
Mean:          1.7375
Max:           3.88

Data:          0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Shuffled data: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Mean:          0
Max:           0

Generating shuffled samples of those values is easy now if you edit a couple of lines:

while (<DATA>) {
    chomp;
    my ($ids, $values) = split(/\,    /, $_);
    if ($csv->parse($values)) {
        @cols = $csv->fields();
        say "Data:                ",join(" ",@cols);
        for (my $i = 0; $i < 10; $i++ ) { # Use a C-style for loop to generate 10 samples
            say "Shuffled data set $i: ",join(" ",shuffle(@cols));
            # Here you can select/sort the shuffled data how you like 
        }
        say "";
    }
    else {
        my $err = $csv->error_input;
        say "Failed to parse line: $err";
    }
}

Looking at the first set, you can see the shuffled data:

$ perl biostars_59584.pl | head -11
Data:                0 0 1 1 1 1 1 1 1 1 0 0 0 0 1.44 2.72 3.84 4.92
Shuffled data set 0: 0 1 1 1 0 0 0 1 2.72 4.92 0 1 1 0 1 3.84 1.44 1
Shuffled data set 1: 1.44 1 4.92 1 1 1 1 0 1 2.72 0 1 3.84 1 0 0 0 0
Shuffled data set 2: 0 4.92 0 1 1 3.84 1 1 1 1 1 2.72 1 0 0 1.44 0 0
Shuffled data set 3: 1 1 0 0 1 3.84 1.44 1 1 2.72 0 1 1 4.92 0 0 1 0
Shuffled data set 4: 4.92 1 1 1.44 0 1 1 0 3.84 2.72 0 0 1 1 0 1 1 0
Shuffled data set 5: 1 0 1 1 4.92 0 0 0 0 1 1.44 1 0 1 3.84 2.72 1 1
Shuffled data set 6: 4.92 3.84 0 1 1 1 1 0 0 1.44 0 1 1 0 2.72 1 0 1
Shuffled data set 7: 1 1 0 0 1 1 1 0 2.72 0 1.44 1 1 1 4.92 3.84 0 0
Shuffled data set 8: 4.92 0 1.44 1 1 0 1 1 1 0 1 1 0 1 0 3.84 0 2.72
Shuffled data set 9: 1 4.92 1 1 0 1 0 1 0 2.72 0 1 1.44 0 1 1 3.84 0

From there, you just need to figure out the appropriate way to sample, then you can calculate the stats and store them (in a hash, for example). About 2) above, I'm not sure it makes sense to slice the first few elements off an array and "resample" those as you are trying to do. Note that there are many ways of sampling an array, so try to figure out the method you want then search for a package on CPAN. For example, there are already efficient methods for resampling means, and also may be more efficient ways of shuffling and selecting the elements from an array. If you are confident on the last point (3), then proceed (it should be pretty easy), or find a more appropriate algorithm and Perl package.

EDIT: Added a line to test if the code will work with older Perl versions. Note that the last link about sorting by index instead of value is something to keep in mind but I don't think it will really make a difference here. I would just use List::Util::shuffle for simplicity.

ADD COMMENT
0
Entering edit mode

@SES Hello SES! I'm sorry for replying late as I was not in my lab the past 2 days. And many thanks for going through my question and attempting to resolve it. However, I think I need to explain myself a bit more. The individual values in the rows need not be shuffled. It is the rows per se which need to be shuffled. Taking the example of the small dataset I've written, it has 4 rows (compared to 41,382 in actual file). These 4 rows are shuffled once. This shuffled array is divided into 2 arrays, @arr1 and @arr2each of which contain 2 rows here for simplicity. Now I find maximum of row1 and row2 in @arr1 . Then the average max is computed in $average_max1. The same thing is repeated for @arr2, to obtain $average_max2. These are the results we obtain after 1st shuffle. Next, the main array @arr is shuffled again, and this time different rows are pushed into @arr1 and @arr2, thereby giving different $average_max1 and $average_max2. The code I've written does a fine job till this point.

ADD REPLY
0
Entering edit mode

After this is where I am getting stuck. 10 shufflings of main array @arr is giving 10 values each of $average_max1 and $average_max2. I need a way to find the average of 10 $average_max1 , let's call it $superaverage1, similarly the average of 10 $average_max2 needs to be found, let's call it $superaverage2. Ultimately, I need to obtain 1000 values of $superaverage1 and 1000 values of $superaverage2. I hope I was able to explain myself a bit better now...

ADD REPLY

Login before adding your answer.

Traffic: 2800 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6