Sum column until a certain value is reached then print that line
3
1
Entering edit mode
8.1 years ago

Hi All,

I have a bedgraph file with 4 columns - Chr, Start, End, Value.

I want to go through the 4th column and sum each row in that column until I have reached a certain value then print that row.

so for instance - Sum column until you a reached a value of 100 then print that entire line

I have done a little scripting but I just can't think how to do this. If someone could help me with an AWK command that would do this that would be great. If you ha the time to break down & explain the command a little that would be a bonus!

Thank you

bedgraph awk • 4.9k views
ADD COMMENT
4
Entering edit mode
8.1 years ago

You could do the following:

$ awk '\
    BEGIN { \
        s = 0; \
    } \
    { \
        s += $4; \
        if (s >= 100) { \
            print $0; \
            exit; \
        } \
    }' foo.bedgraph

Explanation: The BEGIN block initializes a summation variable called s, before any lines are processed. The main block processes each line, adding the value of the fourth column ($4) to the s variable. It then tests if that variable s is greater than or equal to 100. If not, the block processes the next line. If s is over the threshold, the current line is printed ($0) and awk exits early.

ADD COMMENT
1
Entering edit mode

I would strongly advise a new programmer against using numerical variables that will have their values changed during runtime, without explicit initialization.

ADD REPLY
0
Entering edit mode

I never understood the point of the BEGIN block. The above works just as well without it too..

ADD REPLY
0
Entering edit mode

If you're writing stuff to be reused, or at least self-documenting, initializing variables seems like a decent practice.

ADD REPLY
0
Entering edit mode

Inside some script/program.. sure. But as one-liner..

ADD REPLY
0
Entering edit mode

Perhaps your variable is already set where you need it. The "s" here may start at zero without initializing it. I often have to use BEGIN to set the field separators before reading anything.

ADD REPLY
2
Entering edit mode
8.1 years ago
venu 7.1k

I think it would be easy if you create a new column with sum result of each row and then take out the required row

awk '{s+=$4}{print $1"\t"$2"\t"$3"\t"$4"\t"s}' input.txt | awk '$5==100 {print $1"\t"$2"\t"$3"\t"$4}'
ADD COMMENT
0
Entering edit mode
8.1 years ago
John 13k

In python, because I think you can understand how it works immediately :)

with open('/path/to/some.file') as whatever:
    counter = 0
    for line in whatever:
        Chr, Start, End, Value = line.split('\t') # \t is for tab.

        counter += int(Value) # int() to turn the text of value into an actual number.
        if counter > 100:
            print line
            counter = 0
ADD COMMENT
1
Entering edit mode

I have very little python experience but I decided to give this approach a go.

I am getting numerous errors "ValueError: too many values to unpack" "ValueError: invalid literal for int() with base 10: 'Value\t\n'"

When i look at "whatever" , the "Chr, Start, End, Value" seems to be in as "nan"

Does the bedgraph file need to be in a certain format?

ADD REPLY
0
Entering edit mode

Hahaha, awesome :D This is such a nice example of why little scripts like the ones in this thread are best served with Python rather than awk. You said your data had 4 columns - actually your data has 5 columns, because your row ends with 'Value\t\n' - and none of the awk script would have noticed because they all say "take column X and..." rather than the python's much safer "take the 4 columns and...".

I'm sure one could write safe awk - its just that I very rarely see it here on Biostars because people often try to write the shortest, least-safe-as-possible awk scripts, to get the most upvotes, hehe.

So anyway, you got the first error of Value\t\n because there were 5 columns to unpack into 4 variable names. I guess you then hacked the code a bit to get it to work, and in the process almost did, except now the last column includes the \tab and \newline characters. The error about invalid literal for int() with base 10 is because we were expecting something in the 4th column that looks like a number, but instead got "Value\t\n". Why is Value\t\n there in the first place? Probably because you added a header manually, which also explains the trailing tab? So the solution now is to delete the tab at the end of your header line (or if its on all the other lines, tell python to expect 5 columns) and tell python that the first line has to be skipped, because it is a header. You'd do that by on the second line putting next(whatever), then the for line in whatever will start on line 2.

ADD REPLY
1
Entering edit mode

Yes - I did try hack it and I also changed the format of the file.bedgraph a little. Above is the code I used and the output from that code. and first 11 lines file.bedgraph is also on display

ADD REPLY
0
Entering edit mode

Ok, so that number in the fourth column - and this is really my fault because I should have known when you said bedgraph - is a float (decimal number), not an int(), so where it says "int(" in the code put "float(" so the math is right, otherwise you'll be rounding down to the nearest whole number.

Also, it looks like your delimiting on commas now, not tabs, so change the split('\t') to split(',') So the code should now look like:

with open('./file.bedgraph','r') as whatever:
    counter = 0
    next(whatever)
    for line in whatever:
        Chr, Start, End, Value = line.split(",")
        counter += float(Value)
        if counter > 127770:
            print line,
            counter = 0
        else: print counter

I added an else: print counter to the bottom which will print the current counter, just for this demo to show its working as you'd expect. It prints for your data:

-4.98
-5.11
0.46
10.88
16.45
27.73
44.71
59.27
63.27
72.12
ADD REPLY
0
Entering edit mode

Thank you - The script is working nicely!

One thing however, is that once it reaches the 127770 value it continues counting again and finds the next set of intervals that are separated by a value of 127770. e.g

eb2.3chr8,41827920,41827930,319.36 <- my line of interest
eb2.3chr8,41833860,41833870,376.01
eb2.3chr8,41838450,41838460,443.08
eb2.3chr8,41842060,41842070,465.21
eb2.3chr8,41845830,41845840,437.67
eb2.3chr8,41849320,41849330,481.47
eb2.3chr8,41852540,41852550,622.60
eb2.3chr8,41855230,41855240,374.73
eb2.3chr8,41858230,41858240,513.43
eb2.3chr8,41861700,41861710,707.50
eb2.3chr8,41864610,41864620,277.27
eb2.3chr8,41868180,41868190,350.90
eb2.3chr8,41871830,41871840,350.05
eb2.3chr8,41876280,41876290,289.40
eb2.3chr8,41881100,41881110,273.99
eb2.3chr8,41885450,41885460,253.01
eb2.3chr8,41890010,41890020,404.27
eb2.3chr8,41896450,41896460,172.96
eb2.3chr8,41901160,41901170,228.03
ADD REPLY
0
Entering edit mode

Oh, I thought that's what you wanted? :) You need to delete a line from the script for it to keep the counter going and not reset. I think if you can understand what each of those 9 lines of code are doing Joseph, you can figure out the answer to your last question on your own :) Chances are you already have after posting this since you almost got the data in the right format before ... :P

ADD REPLY
1
Entering edit mode

Thank you John - yeah what I done was replaced counter = 0 at the bottom of the script with break and that worked.

ADD REPLY
1
Entering edit mode

Safer Python would probably send the output of line.split() to an array or list to dereference by index, a problem that awk gets around by explicitly dereferencing by index. In addition, it is often a good idea to wrap Python calls that throw common errors/exceptions (such as the int() cast) in try..except blocks.

ADD REPLY
0
Entering edit mode

Agree, there's always safer code one should be using :) But errors aren't such a bad thing in Python, since the stack trace is often quite useful in figuring out the bug. Compared to a JS stack trace anyway, hahah. The really bad thing here was that int() cast would have gone totally undetected if he hadn't pasted his data. That's a bit naughty... for these sorts of posts I should throw in a check for that sort of thing.

ADD REPLY
0
Entering edit mode
with open('/file.bedgraph') as whatever:
counter = 0
next(whatever)
for line in whatever:
    Chr, Start, End, Value = line.split("\t") # \t is for tab.

    counter += int(Value) # int() to turn the text of value into an actual number.
    if counter > 127770:
        print line
        counter = 0

---OUTPUT---
'Chr,Start,End,Value\n'


Traceback (most recent call last):
  File "<pyshell#259>", line 5, in <module>
    Chr, Start, End, Value = line.split("\t") # \t is for tab.
ValueError: need more than 1 value to unpack
ADD REPLY
1
Entering edit mode
file.bedgraph 

Chr,Start,End,Value
eb2.3chr8,41806600,41806660,-4.98
eb2.3chr8,41806660,41806670,-0.13
eb2.3chr8,41806670,41806700,5.57
eb2.3chr8,41806700,41806720,10.42
eb2.3chr8,41806720,41806730,5.57
eb2.3chr8,41806730,41806750,11.28
eb2.3chr8,41806750,41806760,16.98
eb2.3chr8,41806760,41806770,14.56
eb2.3chr8,41806770,41806790,4.00
eb2.3chr8,41806790,41806830,8.85
ADD REPLY

Login before adding your answer.

Traffic: 2573 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6