How to get only rows with values in at least 3 columns?
3
0
Entering edit mode
8.5 years ago
MatTesla • 0

Hi all,

I am sure it is a noob question but have some data from different replicates but in some cases I could measure values only in some cases.

How can I select only rows that have values different from zero in at least 3 columns (replicates)?

Unfortunately I am not sure how to do this yet.

An example of the data:

Cbx5                    0   19    0    11
Vcl                     0   19    0    0
U5-ENSMUSG00000064430   0   18    74   168
Tmbim6                  0   18    7    5
Prcp                    0   18    2    0
Ncl                     0   18    0    1
U4-ENSMUSG00000064606   0   18    0    0
AY172335.5              0   170   38   98
Mest                    0   17    5    10
Cdh1                    0   17    4    6
Spint2                  0   17    2    8
Uggt1                   0   17    2    0
Acly                    0   17    2    0

Thank you for your help.

gene ChIP-Seq • 9.4k views
ADD COMMENT
5
Entering edit mode
8.5 years ago
matted 7.8k

I'll give a Python answer (using Pandas):

import pandas
counts = pandas.read_csv("input.tsv", sep="\t", index_col=0)
counts = counts[(counts > 0).sum(axis=1) >= 3]
counts.to_csv("output.tsv", sep="\t", header=False)
ADD COMMENT
4
Entering edit mode
8.5 years ago
5heikki 11k

Assuming tab-separated values:

awk -F '\t' '{count=0} {for(i=2; i<6; i++) if($i==0) count++} {if(count<2) print $0}' file
ADD COMMENT
1
Entering edit mode

Nice go at a one-liner solution, but I think this will give a different answer, it will print if fewer than three of the replicates have a '0' value.

To find only rows with at least three replicates that are non-zero, this awk command should work:

awk 'BEGIN{FS="\t";OFS="\t"};{count=0;for(i=2; i<6; i++) {if($i!=0) {count++}};if (count>=3){print $0}}' boogens.txt
ADD REPLY
0
Entering edit mode

OP asked "values different from zero in at least 3 columns"

My solution counts zeros, if there are less than 2 (e.g. at least 3 fields have non-zero value), the line is printed.

Your solution counts non-zeros. If there are 3 or more, the line is printed.

I think with OP's data we arrive at the same exact output with both strategies (now that I edited the answer).

ADD REPLY
1
Entering edit mode

There are four data columns, so less-than-three could be two, which means that only two have non-zero values. I'm assuming (and you did too in your for loop) that OP didn't mean to count the label column as a non-zero.

ADD REPLY
0
Entering edit mode

Haha, I'm too tired. You're right. I fixed it :)

ADD REPLY
3
Entering edit mode
8.5 years ago

In R? Try:

dat<- data.frame(x= c(1, 0, 0, 0),
                 y= c(1, 0, 0, 1),
                 z= c(1, 0, 0, 1),
                 w= c(0, 1, 1, 1))
dat
  x y z w
1 1 1 1 0
2 0 0 0 1
3 0 0 0 1
4 0 1 1 1
cntNonZero<- apply(dat, 1, function(x) sum(x != 0))
dat[which(cntNonZero >=3),]
  x y z w
1 1 1 1 0
4 0 1 1 1
ADD COMMENT
4
Entering edit mode
dat[rowSums( dat != 0 ) >= 3,] #;)
ADD REPLY

Login before adding your answer.

Traffic: 2926 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6