Biostar Beta. Not for public use.
filter out multiple rows based on values in columns - perl and R attempt
2
Entering edit mode
19 months ago
Illinu • 90
Belgium

Similar to this post I want to filter out all the rows that contain zero value at all columns. I have a file with transcript counts for each sample+replicate and it turns out that some transcripts have 0 counts for all samples and replicates, and in other cases only one sample does not have zero counts but all the rest of the samples do, so what I want to do is to filter out:

1) all transcripts where there is zero counts for all samples and replicates

2) all transcripts where there is zero counts for all samples except one (e.g., A and B but not C, A and C but not B, B and C but not A)

For example, input:

A_rep1 A_rep2 B_rep1 B_rep2 C_rep1 C_rep2
s1 0 6 5 3 0 9
s2 66 0 5 32 8 0
s3 0 0 0 0 0 0
s4 8 22 0 4 5 5

A_rep1 A_rep2 B_rep1 B_rep2 C_rep1 C_rep2
s1 0 6 5 3 0 9
s2 66 0 5 32 8 0
s4 8 22 0 4 5 5

I've been trying in a number of ways to automate the process instead of doing it manually in Excel buy filtering. So my first attempt was in R. For the first task it works well but then when I need to parse the file to process the other tasks it doesn't work.

set1 <- as.matrix(data[,-1])
row.names(set1)<- data[,1]
all <- apply(set1, 1, function(x) all(x[1:16]==0))
newdata <- set1[!all,]
write.table(newdata, "genes.counts.matrix.modified", sep="\t")

# also my problem here is that the output places the headers from column1 but the headers should go on top of the counts and not start at the transcript column. It looks like this

A_rep1 A_rep2 B_rep1 B_rep2 C_rep1 C_rep2
s1 0 6 5 3 0 9

Then I tried with a oneliner perl but it is not working

perl -a -nle 'print if "$F[1-16] != 0" ' genes.counts.matrix > genes.counts.matrix.modified or perl -a -nle 'print if "$F[1]:$F[16] != 0" ' genes.counts.matrix > genes.counts.matrix.modified My idea is to filter out first when all rows are equal to zero, next when rows from 1:12 are equal to zero, next when rows 1:4 and 9:16 are equal to zero, next when rows 1:8 and 12:16 are equal to zero and finally when rows 5:16 are equal to zero This was my attempt in R and it didn't work all <- apply(set1, 1, function(x) (all(x[1:16]==0) | all(x[4:16]==0) | all(x[1:12]==0) | (all(x[1:4]==0) & all(x[9:16]==0)) | (all(x[1:8]==0) & all(x[12:16]==0)))) newdata <- set1[!all,] Linu ADD COMMENTlink 3 Entering edit mode 18 months ago komal.rathi ♦ 3.4k Children's Hospital of Philadelphia, Ph… I am surprised the last line of code did not work for you, I have removed a pair of braces & followed what you want, apart from that I changed nothing and it is working on my sample data: "# My idea is to filter out first when all rows are equal to zero, # next when rows from 1:12 are equal to zero, # next when rows 1:4 and 9:16 are equal to zero, # next when rows 1:8 and 12:16 are equal to zero and # finally when rows 5:16 are equal to zero" all <- apply(set1, 1, function(x) all(x==0) | all(x[1:12]==0) | (all(x[1:4]==0) & all(x[9:16]==0)) | (all(x[1:8]==0) & all(x[12:16]==0)) | all(x[5:16]==0)) newdata <- set1[!all,]  As for the headers, write out the output like this, so that when you open the output in any other application, the headers are not messed up: newdata = cbind(samplenames = rownames(newdata), newdata) write.table(newdata,'results.txt',quote = F,sep='\t',row.names = F,col.names = T)  ADD COMMENTlink 0 Entering edit mode Yes!! your bracket changes worked!! I still get the headers moved to the left but I am more than satisfied with having the function to work. Thanks a million!! ADD REPLYlink 0 Entering edit mode https://www.biostars.org/u/14211/ I have edited my answer to account for the problem with the header. ADD REPLYlink 0 Entering edit mode This is not working, I have two outcomes: # This one applies the functions but messes up the headers > data=read.table('genes.counts.matrix', header=T, row.names=1) > all <- apply(data, 1, function(x) all(x==0) | all(x[1:12]==0) | (all(x[1:4]==0) & all(x[9:16]==0)) | (all(x[1:8]==0) & all(x[13:16]==0)) | all(x[5:16]==0)) > newdata <- data[!all,] > write.table(newdata, "genes.counts.matrix.modifiedWithR", sep="\t")  # This one returns the file with the header in place but it does not apply the function > data=read.table('genes.counts.matrix', header=T, row.names=1) > set1 <- as.matrix(data[,-1]) > set1 = cbind(transcripts = rownames(set1), set1) > all <- apply(set1, 1, function(x) all(x==0) | all(x[1:12]==0) | (all(x[1:4]==0) & all(x[9:16]==0)) | (all(x[1:8]==0) & all(x[13:16]==0)) | all(x[5:16]==0)) > newdata <- set1[!all,] > write.table(newdata, "genes.counts.matrix.modifiedWithR",quote = F,sep='\t',row.names = F,col.names = T)  ADD REPLYlink 0 Entering edit mode Umm, of course it will not apply the function! the order of r commands is incorrect. Apply the function first, then use cbind() and then write it! My bad, I used 'set1' as the name of file to write out instead of 'newdata'. I have made that change. Just follow the order like I have shown. ADD REPLYlink 0 Entering edit mode Excellent, that worked!! thanks a million ADD REPLYlink 1 Entering edit mode 17 months ago London, UK A trick in R is to convert any dataframe to a "long" or tidy format: > library(tidyr) > library(dplyr) > biost individual A_rep1 A_rep2 B_rep1 B_rep2 C_rep1 C_rep2 1 s1 0 6 5 3 0 9 2 s2 66 0 5 32 8 0 3 s3 0 0 0 0 0 0 4 s4 8 22 0 4 5 5 > biost.tidy = biost %>% gather(rep, value, -individual) %>% separate(rep, into=c('sample', 'replica'), sep='_') > biost.tidy individual sample replica value 1 s1 A rep1 0 2 s2 A rep1 66 3 s3 A rep1 0 4 s4 A rep1 8 5 s1 A rep2 6 6 s2 A rep2 0 7 s3 A rep2 0 8 s4 A rep2 22 9 s1 B rep1 5 10 s2 B rep1 5 11 s3 B rep1 0 12 s4 B rep1 0 13 s1 B rep2 3 14 s2 B rep2 32 15 s3 B rep2 0 16 s4 B rep2 4 17 s1 C rep1 0 18 s2 C rep1 8 19 s3 C rep1 0 20 s4 C rep1 5 21 s1 C rep2 9 22 s2 C rep2 0 23 s3 C rep2 0 24 s4 C rep2 5  Here biost.tidy is your data frame in a "tidy" format, in which each combination of individual, sample and replica is on a separate line. This format will make it much easier to plot the data frame with ggplot, or to calculate means by individual or sample. See http://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html for more info. For example you can calculate the number of not-null samples for each individual: > biost.count = biost.tidy %>% group_by(individual) %>% summarise( tot_notnull = sum(value!=0) ) Source: local data frame [4 x 3] individual tot_notnull 1 s1 4 2 s2 4 3 s3 0 4 s4 5  This tells you that individual s3 has no value different from 0, while s1 and s2 have 4 not-null values each. This other call will give you the number of samples in which the individual has at least one non-null score: biost.tidy %>% group_by(individual, sample) %>% summarise(non_null=sum(value!=0)) %>% group_by(individual) %>% summarise(n=sum(non_null>0)) Source: local data frame [4 x 2] individual n 1 s1 3 2 s2 3 3 s3 0 4 s4 3  This tells you that for individuals s1,s2, and s3, all three samples have at least one replica > 0. Modify the last condition (sum(non_null)>1) to get the number of individuals were both replicas are non-null. You can use either of these datasets to subset your initial data frame: > biost.tidy %>% subset(individual %in% subset(biost.count, tot_notnull>2)$individual)
individual sample replica value
1          s1      A    rep1     0
2          s2      A    rep1    66
4          s4      A    rep1     8
5          s1      A    rep2     6
6          s2      A    rep2     0
8          s4      A    rep2    22
9          s1      B    rep1     5
10         s2      B    rep1     5
12         s4      B    rep1     0
13         s1      B    rep2     3
14         s2      B    rep2    32
16         s4      B    rep2     4
17         s1      C    rep1     0
18         s2      C    rep1     8
20         s4      C    rep1     5
21         s1      C    rep2     9
22         s2      C    rep2     0
24         s4      C    rep2     5


Or, if you prefer, you can subset your initial data frame:

> biost %>% subset(individual %in% subset(biost.count, tot_notnull>2)\$individual)
individual A_rep1 A_rep2 B_rep1 B_rep2 C_rep1 C_rep2
1         s1      0      6      5      3      0      9
2         s2     66      0      5     32      8      0
4         s4      8     22      0      4      5      5


For the problem of reading row names, try:

read.table('yourmatrix.csv', header=T, row.names=1)

1
Entering edit mode

I think the OP is facing a problem with the headers not while reading the output in R, but opening the output in some kind of application like Excel.

0
Entering edit mode

This is a great piece of information for my R knowledge. Thank you for the time explaining everything.

The row.names did not fix the displacement of the headers. In summary doing the following gets the correct number of rows filtered as I wanted but I still see the headers starting from column 1 instead of 2.

> data=read.table('genes.counts.matrix', header=T, row.names=1)
> all <- apply(data, 1, function(x) all(x==0) | all(x[1:12]==0) | (all(x[1:4]==0) & all(x[9:16]==0)) | (all(x[1:8]==0) & all(x[13:16]==0)) | all(x[5:16]==0))
> newdata <- data[!all,]
> write.table(newdata, "genes.counts.matrix.modifiedWithR", sep="\t")

0
Entering edit mode
14 months ago
Belgium, Brussels

try this ?

Filter 1:
data.filtered <- data[rowSums(data)!=0,]


Filter 2:

data.tmp <- data

data.tmp[data.tmp>0] <- 1

data.filtered2 <- data[rowSums(data.tmp)>1,]

0
Entering edit mode

Filter 1 worked but I had to change it to:

data.filtered <- data[rowSums(set1)!=0,]


Filter 2 didn't work well for me. I followed what komal.rathi suggested and it worked.

But thanks a lot for your suggestion

Similar Posts