Question

How to re-order (sort) to get top values for each column in excel sheet of OTUs?

0

Entering edit mode

5.8 years ago

arsilan324 ▴ 90

Hi all,

I have OTU list with 73 samples (columns) and now I want to extract top 10 for each column including the respective annotation (genus level) provided at the end. This data shows a few only but actually i have hundreds of rows. The data looks like this (subset). Can you please help how to do this?

ID-a    ID-b    ID-c    ID-d    ID-e    ID-f    ID-g    Genus
3709    5664    54  11  5   2   0   Methylocystis
1518    399 3   0   0   0   0   Kineococcus
15215   4016    46  3   2   0   0   Curvibacter
5   0   1   1   1   1   0   Sulfuritalea
13098   4699    47  5   3   13  10  Bradyrhizobium
5565    6451    2101    2774    1513    203 120 Dyella

The output would read like, separate arrangements of OTU annotation for each column (highest on top). And annotation may come next to each column. This can be done manually by sort function in excel but I would appreciate to know a quick smart way of doing this. Thanks

Excel sorting • 2.4k views

ADD COMMENT • link updated 5.8 years ago by shenwei356 8.4k • written 5.8 years ago by arsilan324 ▴ 90

1

Entering edit mode

While it may indeed be possible to do this automagically in excel it will require VB scripting or something similar. You will likely get pushback on this for trying to use excel to do bioinformatics. It is not a great idea to do so.

You may want to export the data from excel into a delimited (comma or tab) format. At that point you can use unix sort (by column) to extract the info you need with cut or awk.

ADD REPLY • link 5.8 years ago by GenoMax 141k

3

Entering edit mode

5.8 years ago

sacha ★ 2.4k

Use everything else but not excel . This software is to difficult to use.
For example with Python, this code will answer to your question. 4 lines only...

import pandas as pd  
df = pd.read_excel("test.xlsx","test")   
for i in df.columns:
     print(df[i].sort_values(ascending=False))

It returns :

Genus
Curvibacter       15215
Bradyrhizobium    13098
Dyella             5565
Methylocystis      3709
Kineococcus        1518
Sulfuritalea          5
Name: ID-a, dtype: int64

Genus
Dyella            6451
Methylocystis     5664
Bradyrhizobium    4699
Curvibacter       4016
Kineococcus        399
Sulfuritalea         0
Name: ID-b, dtype: int64

Genus
Dyella            2101
Methylocystis       54
Bradyrhizobium      47
Curvibacter         46
Kineococcus          3
Sulfuritalea         1
Name: ID-c, dtype: int64

Genus
Dyella            2774
Methylocystis       11
Bradyrhizobium       5
Curvibacter          3
Sulfuritalea         1
Kineococcus          0
Name: ID-d, dtype: int64

Genus
Dyella            1513
Methylocystis        5
Bradyrhizobium       3
Curvibacter          2
Sulfuritalea         1
Kineococcus          0
Name: ID-e, dtype: int64

Genus
Dyella            203
Bradyrhizobium     13
Methylocystis       2
Sulfuritalea        1
Curvibacter         0
Kineococcus         0
Name: ID-f, dtype: int64

Genus
Bradyrhizobium    10.0
Sulfuritalea       0.0
Curvibacter        0.0
Kineococcus        0.0
Methylocystis      0.0
Dyella             NaN
Name: ID-g, dtype: float64

ADD COMMENT • link 5.8 years ago by sacha ★ 2.4k

2

Entering edit mode

5.8 years ago

shenwei356 8.4k

Well, just for fun. Here's a long one-liner using csvtk and rush (can be replaced with GNU parallel). You can run this in Windows if you install the windows version !

csvtk xlsx2csv data.xlsx > data.csv

csvtk cut -f -Genus data.csv \
    | csvtk headers \
    | csvtk cut -t -f 2 \
    | rush -k "echo \"# {}\"; \
            csvtk cut -f {},Genus data.csv \
                | csvtk rename -f 1 -n Count \
                | csvtk sort -k Count:nr \
                | csvtk cut -f Genus,Count \
                | csvtk pretty \
                | csvtk head -n 11; \
            echo "

Output

# ID-a
Genus            Count
Curvibacter      15215
Bradyrhizobium   13098
Dyella           5565
Methylocystis    3709
Kineococcus      1518
Sulfuritalea     5

# ID-b
Genus            Count
Dyella           6451
Methylocystis    5664
Bradyrhizobium   4699
Curvibacter      4016
Kineococcus      399
Sulfuritalea     0

...

ADD COMMENT • link 5.8 years ago by shenwei356 8.4k

score 4 · Accepted Answer · 2018-07-04

4

Entering edit mode

5.8 years ago

Pierre Lindenbaum 161k

A one liner with 3 values with a tab-delimited file.

Loo over each colum, extract this column and the last column, save in a file, extract header, sort the remaining columns.

$ head -n 1 input.txt | tr "\t" "\n" | grep -v Genus | awk '{print NR;}' | while read C; do awk -F '\t' -v C=$C '{printf("%s%s\t%s\n",(NR==1?"#":""),$C,$NF);}' input.txt >  tmp.txt && head -n1 tmp.txt && tail -n +2 tmp.txt| LC_ALL=C sort -k1,1nr | head -n 3 && echo "########"  && rm tmp.txt ; done 

#ID-a   Genus
15215   Curvibacter
13098   Bradyrhizobium
5565    Dyella
########
#ID-b   Genus
6451    Dyella
5664    Methylocystis
4699    Bradyrhizobium
########
#ID-c   Genus
2101    Dyella
54  Methylocystis
47  Bradyrhizobium
########
#ID-d   Genus
2774    Dyella
11  Methylocystis
5   Bradyrhizobium
########
#ID-e   Genus
1513    Dyella
5   Methylocystis
3   Bradyrhizobium
########
#ID-f   Genus
203 Dyella
13  Bradyrhizobium
2   Methylocystis
########
#ID-g   Genus
120 Dyella
10  Bradyrhizobium
0   Curvibacter
########

ADD COMMENT • link 5.8 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

This command is not producing any file! Is it suppose to generate a file? I cannot see any error as well. I am using terminal.

head -n 1 seq.csv | tr "\t" "\n" | grep -v Genus | awk '{print NR;}' | while read C; do awk -F '\t' -v C=$C '{printf("%s%s\t%s\n",(NR==1?"#":""),$C,$NF);}' seq.csv >  tmp.csv && head -n1 tmp.txt && tail -n +2 tmp.csv| LC_ALL=C sort -k1,1nr | head -n 3 && echo "########"  && rm tmp.csv ; done

ADD REPLY • link 5.8 years ago by arsilan324 ▴ 90

0

Entering edit mode

It worked. I worked with txt file but i was using csv initially. Thanks

ADD REPLY • link 5.8 years ago by arsilan324 ▴ 90

0

Entering edit mode

One more question, can I select only top 10 using this command? Or it will sort all. It is taking longer. Running since half an hour and not finished yet...

ADD REPLY • link 5.8 years ago by arsilan324 ▴ 90

score 3 · Accepted Answer · 2018-07-04

3

Entering edit mode

5.8 years ago

cpad0112 21k

you can follow this link: https://www.extendoffice.com/documents/excel/4265-excel-sort-multiple-columns-independently.html?

or export the data from excel in tsv (tab separated values) or csv (comma separated ) and do it in R:

$ df1=read.csv("test.txt", sep = "\t", strip.white = T, stringsAsFactors = F)
$ sdf1=data.frame(apply(df1[,-ncol(df1)],2,sort),Genus=df1[,ncol(df1)])
$ write.table(sdf1,"sorted_data.txt",row.names = F, quote = F, sep="\t") ## To export as tsv
$ library(WriteXLS) ## For exporting to excel
$ WriteXLS(sdf1,"sdf1.xls", row.names = F)

edit: Removed cbind

ADD COMMENT • link 5.8 years ago by cpad0112 21k

0

Entering edit mode

Not able to understand what the output is? It converts the last column "Genus" into numbers. But I need separately for each each column.

ADD REPLY • link 5.8 years ago by arsilan324 ▴ 90

0

Entering edit mode

> str(sdf1)
'data.frame':   6 obs. of  8 variables:
 $ ID.a : int  5 1518 3709 5565 13098 15215
 $ ID.b : int  0 399 4016 4699 5664 6451
 $ ID.c : int  1 3 46 47 54 2101
 $ ID.d : int  0 1 3 5 11 2774
 $ ID.e : int  0 1 2 3 5 1513
 $ ID.f : int  0 0 1 2 13 203
 $ ID.g : int  0 0 0 0 10 120
 $ Genus: Factor w/ 6 levels "Bradyrhizobium",..: 5 4 2 6 1 3
> sdf1
   ID.a ID.b ID.c ID.d ID.e ID.f ID.g          Genus
1     5    0    1    0    0    0    0  Methylocystis
2  1518  399    3    1    1    0    0    Kineococcus
3  3709 4016   46    3    2    1    0    Curvibacter
4  5565 4699   47    5    3    2    0   Sulfuritalea
5 13098 5664   54   11    5   13   10 Bradyrhizobium
6 15215 6451 2101 2774 1513  203  120         Dyella

ADD REPLY • link 5.8 years ago by cpad0112 21k

0

Entering edit mode

What i get is this

> str(sdf1)
'data.frame':   2408 obs. of  74 variables:
 $ R3.Rhizo.1: int  0 0 0 0 0 0 0 0 0 0 ...
 $ R3.Rhizo.2: int  0 0 0 0 0 0 0 0 0 0 ...
 $ R3.Rhizo.3: int  0 0 0 0 0 0 0 0 0 0 ...
 $ R3.Rhizo.4: int  0 0 0 0 0 0 0 0 0 0 ...
$ Genus     : int  197 598 501 2 119 101 74 197 639 405 ...

All values are zero but genus has numbers. Whats wrong?!

ADD REPLY • link 5.8 years ago by arsilan324 ▴ 90

0

Entering edit mode

Check your df1 and delimiter. Above code is for tsv. See if your values are ts. if not use appropriate delimiter. check the str of df1

ADD REPLY • link 5.8 years ago by cpad0112 21k

0

Entering edit mode

It worked! Thanks!

But now another question. It has arranged all the values in ascending order but the genus name is still at the end (common for all). I doubt if it is right. Each column should have its own unique Genus arrangement (if not at all, but at some levels). Do you understand my point?

Thank you!

ADD REPLY • link 5.8 years ago by arsilan324 ▴ 90

0

Entering edit mode

Got is. Here is the solution

$ df1=read.csv("test.txt", sep = "\t", strip.white = T, stringsAsFactors = F)
$ library(dplyr)
$ library(tidyr)
$ library(purrr)
$ df1%>% gather(.,k,v,-Genus) %>% split(.$k) %>% map(~select(.,c(1,3)) %>% arrange(desc(v)))

output:

$ID.a
           Genus     v
1    Curvibacter 15215
2 Bradyrhizobium 13098
3         Dyella  5565
4  Methylocystis  3709
5    Kineococcus  1518
6   Sulfuritalea     5

$ID.b
           Genus    v
1         Dyella 6451
2  Methylocystis 5664
3 Bradyrhizobium 4699
4    Curvibacter 4016
5    Kineococcus  399
6   Sulfuritalea    0

$ID.c
           Genus    v
1         Dyella 2101
2  Methylocystis   54
3 Bradyrhizobium   47
4    Curvibacter   46
5    Kineococcus    3
6   Sulfuritalea    1

$ID.d
           Genus    v
1         Dyella 2774
2  Methylocystis   11
3 Bradyrhizobium    5
4    Curvibacter    3
5   Sulfuritalea    1
6    Kineococcus    0

$ID.e
           Genus    v
1         Dyella 1513
2  Methylocystis    5
3 Bradyrhizobium    3
4    Curvibacter    2
5   Sulfuritalea    1
6    Kineococcus    0

$ID.f
           Genus   v
1         Dyella 203
2 Bradyrhizobium  13
3  Methylocystis   2
4   Sulfuritalea   1
5    Kineococcus   0
6    Curvibacter   0

$ID.g
           Genus   v
1         Dyella 120
2 Bradyrhizobium  10
3  Methylocystis   0
4    Kineococcus   0
5    Curvibacter   0
6   Sulfuritalea   0

simple one:

$ df1%>% gather(.,k,v,-Genus) %>% arrange(k,desc(v)) %>% select (k,Genus,v)

        k          Genus     v
1  ID.a    Curvibacter 15215
2  ID.a Bradyrhizobium 13098
3  ID.a         Dyella  5565
4  ID.a  Methylocystis  3709
5  ID.a    Kineococcus  1518
6  ID.a   Sulfuritalea     5
7  ID.b         Dyella  6451
8  ID.b  Methylocystis  5664
9  ID.b Bradyrhizobium  4699
10 ID.b    Curvibacter  4016
11 ID.b    Kineococcus   399
12 ID.b   Sulfuritalea     0
13 ID.c         Dyella  2101
14 ID.c  Methylocystis    54
15 ID.c Bradyrhizobium    47
16 ID.c    Curvibacter    46
17 ID.c    Kineococcus     3
18 ID.c   Sulfuritalea     1
19 ID.d         Dyella  2774
20 ID.d  Methylocystis    11
21 ID.d Bradyrhizobium     5
22 ID.d    Curvibacter     3
23 ID.d   Sulfuritalea     1
24 ID.d    Kineococcus     0
25 ID.e         Dyella  1513
26 ID.e  Methylocystis     5
27 ID.e Bradyrhizobium     3
28 ID.e    Curvibacter     2
29 ID.e   Sulfuritalea     1
30 ID.e    Kineococcus     0
31 ID.f         Dyella   203
32 ID.f Bradyrhizobium    13
33 ID.f  Methylocystis     2
34 ID.f   Sulfuritalea     1
35 ID.f    Kineococcus     0
36 ID.f    Curvibacter     0
37 ID.g         Dyella   120
38 ID.g Bradyrhizobium    10
39 ID.g  Methylocystis     0
40 ID.g    Kineococcus     0
41 ID.g    Curvibacter     0
42 ID.g   Sulfuritalea     0