filter out multiple rows based on values in columns - perl and R attempt
3
2
Entering edit mode
9.2 years ago
Illinu ▴ 110

Similar to this post, I want to filter out all the rows that contain zero value at all columns. I have a file with transcript counts for each sample+replicate and it turns out that some transcripts have 0 counts for all samples and replicates, and in other cases only one sample does not have zero counts but all the rest of the samples do, so what I want to do is to filter out:

  1. All transcripts where there is zero counts for all samples and replicates
  2. All transcripts where there is zero counts for all samples except one (e.g., A and B but not C, A and C but not B, B and C but not A)

For example, input:

       A_rep1     A_rep2     B_rep1     B_rep2     C_rep1     C_rep2
s1     0          6          5          3          0          9
s2     66         0          5          32         8          0
s3     0          0          0          0          0          0
s4     8          22         0          4          5          5

Output of task 1):

       A_rep1     A_rep2     B_rep1     B_rep2     C_rep1     C_rep2
s1     0          6          5          3          0          9
s2     66         0          5          32         8          0
s4     8          22         0          4          5          5

I've been trying in a number of ways to automate the process instead of doing it manually in Excel buy filtering. So my first attempt was in R. For the first task it works well but then when I need to parse the file to process the other tasks it doesn't work.

data=read.table('genes.counts.matrix', header=T)
set1 <- as.matrix(data[,-1])
row.names(set1)<- data[,1]
all <- apply(set1, 1, function(x) all(x[1:16]==0))
newdata <- set1[!all,]
write.table(newdata, "genes.counts.matrix.modified", sep="\t")

Also my problem here is that the output places the headers from column1 but the headers should go on top of the counts and not start at the transcript column. It looks like this

       A_rep1     A_rep2     B_rep1     B_rep2     C_rep1     C_rep2      
s1     0          6          5          3          0          9

Then I tried with a oneliner perl but it is not working

perl -a -nle 'print if "$F[1-16] != 0" ' genes.counts.matrix > genes.counts.matrix.modified

or

perl -a -nle 'print if "$F[1]:$F[16] != 0" ' genes.counts.matrix > genes.counts.matrix.modified

My idea is to filter out first when all rows are equal to zero, next when rows from 1:12 are equal to zero, next when rows 1:4 and 9:16 are equal to zero, next when rows 1:8 and 12:16 are equal to zero and finally when rows 5:16 are equal to zero

This was my attempt in R and it didn't work

> all <- apply(set1, 1, function(x) (all(x[1:16]==0) | all(x[4:16]==0) | all(x[1:12]==0) | (all(x[1:4]==0) & all(x[9:16]==0)) | (all(x[1:8]==0) & all(x[12:16]==0))))
> newdata <- set1[!all,]

Linu

perl R • 25k views
ADD COMMENT
3
Entering edit mode
9.2 years ago
komal.rathi ★ 4.1k

I am surprised the last line of code did not work for you, I have removed a pair of braces & followed what you want, apart from that I changed nothing and it is working on my sample data:

# My idea is to filter out first when all rows are equal to zero,
# next when rows from 1:12 are equal to zero,
# next when rows 1:4 and 9:16 are equal to zero,
# next when rows 1:8 and 12:16 are equal to zero and
# finally when rows 5:16 are equal to zero"

all <- apply(set1, 1, function(x) all(x==0) | all(x[1:12]==0) | (all(x[1:4]==0) & all(x[9:16]==0)) | (all(x[1:8]==0) & all(x[12:16]==0)) | all(x[5:16]==0))
newdata <- set1[!all,]

As for the headers, write out the output like this, so that when you open the output in any other application, the headers are not messed up:

newdata = cbind(samplenames = rownames(newdata), newdata)
write.table(newdata,'results.txt',quote = F,sep='\t',row.names = F,col.names = T)
ADD COMMENT
0
Entering edit mode

Yes!! your bracket changes worked!! I still get the headers moved to the left but I am more than satisfied with having the function to work. Thanks a million!!

ADD REPLY
0
Entering edit mode

Illinu I have edited my answer to account for the problem with the header.

ADD REPLY
0
Entering edit mode

This is not working, I have two outcomes:

This one applies the functions but messes up the headers

> data=read.table('genes.counts.matrix', header=T, row.names=1)
> all <- apply(data, 1, function(x) all(x==0) | all(x[1:12]==0) | (all(x[1:4]==0) & all(x[9:16]==0)) | (all(x[1:8]==0) & all(x[13:16]==0)) | all(x[5:16]==0))
> newdata <- data[!all,]
> write.table(newdata, "genes.counts.matrix.modifiedWithR", sep="\t")

This one returns the file with the header in place but it does not apply the function

> data=read.table('genes.counts.matrix', header=T, row.names=1)
> set1 <- as.matrix(data[,-1])
> set1 = cbind(transcripts = rownames(set1), set1)
> all <- apply(set1, 1, function(x) all(x==0) | all(x[1:12]==0) | (all(x[1:4]==0) & all(x[9:16]==0)) | (all(x[1:8]==0) & all(x[13:16]==0)) | all(x[5:16]==0))
> newdata <- set1[!all,]
> write.table(newdata, "genes.counts.matrix.modifiedWithR",quote = F,sep='\t',row.names = F,col.names = T)
ADD REPLY
0
Entering edit mode

Umm, of course it will not apply the function! the order of r commands is incorrect. Apply the function first, then use cbind() and then write it! My bad, I used 'set1' as the name of file to write out instead of 'newdata'.

I have made that change. Just follow the order like I have shown.

ADD REPLY
0
Entering edit mode

Excellent, that worked!! thanks a million

ADD REPLY
1
Entering edit mode
9.2 years ago

A trick in R is to convert any dataframe to a "long" or tidy format:

> library(tidyr)
> library(dplyr)
> biost
  individual A_rep1 A_rep2 B_rep1 B_rep2 C_rep1 C_rep2
1         s1      0      6      5      3      0      9
2         s2     66      0      5     32      8      0
3         s3      0      0      0      0      0      0
4         s4      8     22      0      4      5      5
> biost.tidy = biost %>% 
    gather(rep, value, -individual) %>% 
    separate(rep, into=c('sample', 'replica'), sep='_')

> biost.tidy
   individual sample replica value
1          s1      A    rep1     0
2          s2      A    rep1    66
3          s3      A    rep1     0
4          s4      A    rep1     8
5          s1      A    rep2     6
6          s2      A    rep2     0
7          s3      A    rep2     0
8          s4      A    rep2    22
9          s1      B    rep1     5
10         s2      B    rep1     5
11         s3      B    rep1     0
12         s4      B    rep1     0
13         s1      B    rep2     3
14         s2      B    rep2    32
15         s3      B    rep2     0
16         s4      B    rep2     4
17         s1      C    rep1     0
18         s2      C    rep1     8
19         s3      C    rep1     0
20         s4      C    rep1     5
21         s1      C    rep2     9
22         s2      C    rep2     0
23         s3      C    rep2     0
24         s4      C    rep2     5

Here biost.tidy is your data frame in a "tidy" format, in which each combination of individual, sample and replica is on a separate line. This format will make it much easier to plot the data frame with ggplot, or to calculate means by individual or sample. See http://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html for more info.

For example you can calculate the number of not-null samples for each individual:

> biost.count = biost.tidy %>% 
    group_by(individual) %>% 
    summarise(
       tot_notnull = sum(value!=0)
    )

Source: local data frame [4 x 3]

  individual tot_notnull 
1         s1           4
2         s2           4
3         s3           0                                   
4         s4           5

This tells you that individual s3 has no value different from 0, while s1 and s2 have 4 not-null values each.

This other call will give you the number of samples in which the individual has at least one non-null score:

biost.tidy %>% group_by(individual, sample) %>% summarise(non_null=sum(value!=0)) %>% group_by(individual) %>% summarise(n=sum(non_null>0))

Source: local data frame [4 x 2]

  individual n
1         s1 3
2         s2 3
3         s3 0
4         s4 3

This tells you that for individuals s1,s2, and s3, all three samples have at least one replica > 0. Modify the last condition (sum(non_null)>1) to get the number of individuals were both replicas are non-null.

You can use either of these datasets to subset your initial data frame:

> biost.tidy %>% subset(individual %in% subset(biost.count, tot_notnull>2)$individual)
   individual sample replica value
1          s1      A    rep1     0
2          s2      A    rep1    66
4          s4      A    rep1     8
5          s1      A    rep2     6
6          s2      A    rep2     0
8          s4      A    rep2    22
9          s1      B    rep1     5
10         s2      B    rep1     5
12         s4      B    rep1     0
13         s1      B    rep2     3
14         s2      B    rep2    32
16         s4      B    rep2     4
17         s1      C    rep1     0
18         s2      C    rep1     8
20         s4      C    rep1     5
21         s1      C    rep2     9
22         s2      C    rep2     0
24         s4      C    rep2     5

Or, if you prefer, you can subset your initial data frame:

> biost %>% subset(individual %in% subset(biost.count, tot_notnull>2)$individual)
  individual A_rep1 A_rep2 B_rep1 B_rep2 C_rep1 C_rep2
1         s1      0      6      5      3      0      9
2         s2     66      0      5     32      8      0
4         s4      8     22      0      4      5      5

For the problem of reading row names, try:

read.table('yourmatrix.csv', header=T, row.names=1)
ADD COMMENT
1
Entering edit mode

I think the OP is facing a problem with the headers not while reading the output in R, but opening the output in some kind of application like Excel.

ADD REPLY
0
Entering edit mode

This is a great piece of information for my R knowledge. Thank you for the time explaining everything.

The row.names did not fix the displacement of the headers. In summary doing the following gets the correct number of rows filtered as I wanted but I still see the headers starting from column 1 instead of 2.

> data=read.table('genes.counts.matrix', header=T, row.names=1)
> all <- apply(data, 1, function(x) all(x==0) | all(x[1:12]==0) | (all(x[1:4]==0) & all(x[9:16]==0)) | (all(x[1:8]==0) & all(x[13:16]==0)) | all(x[5:16]==0))
> newdata <- data[!all,]
> write.table(newdata, "genes.counts.matrix.modifiedWithR", sep="\t")
ADD REPLY
0
Entering edit mode
9.2 years ago

Try this?

Filter 1:

data.filtered <- data[rowSums(data)!=0,]

Filter 2:

data.tmp <- data
data.tmp[data.tmp>0] <- 1
data.filtered2 <- data[rowSums(data.tmp)>1,]
ADD COMMENT
0
Entering edit mode

Filter 1 worked but I had to change it to:

data.filtered <- data[rowSums(set1)!=0,]

Filter 2 didn't work well for me. I followed what komal.rathi suggested and it worked.

But thanks a lot for your suggestion

ADD REPLY

Login before adding your answer.

Traffic: 1992 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6