Biostar Beta. Not for public use.
A possible script request for a task probably cannot be done in Excel
0
Entering edit mode
16 months ago
xiaoyonf • 10
Baylor College of Medicine, Houston, Te…

Dear all,

I have a dumb question that bothers me a lot. I have a set of genes with calculated ChIP-seq binding intensities. Notably, most of these genes have more than one value (i.e., because of multiple binding I thought). Now I want to clean these data by calculating the average of biding for each gene. The head of the data likes the following,

A2M -3.63
A2M -18.46
ACE 1.68
ACYP2   8.23
ACYP2   6.09
ADAMTS12    1.77
ADAMTS12    1.65
ADAMTS6 5.36
ADIPOQ  4.06
ADIPOQ  1.71
AGR2    13.71
AGR2    8.92
AGR2    5.05
AGR2    3.02
AGR2    2.23
AGR2    -4.22

I have these data in excel file. Obviously, it is annoying to do manually in Excel because of thousands of genes. How do I do this? I guess a simple script may help. But I have no experience in programming. Your help is much appreciated! Thank you! Xiaoyong Fu Houston, TX

excel • 928 views
ADD COMMENTlink
3
Entering edit mode

I hate excel but I think i can be easily done with "pivot table" isn't it ? https://support.office.com/en-us/article/Calculate-values-in-a-PivotTable-report-11f41417-da80-435c-a5c6-b0185e59da77

ADD REPLYlink
4
Entering edit mode

@Pierre recommending using excel for something. Well .. that is a first!

ADD REPLYlink
1
Entering edit mode

No I didn't ! No I didn't ! That wasn't me !

ADD REPLYlink
2
Entering edit mode

No worries everyone, I made a screenshot.

ADD REPLYlink
0
Entering edit mode

enter image description here

ADD REPLYlink
0
Entering edit mode

You are such a meme person!

ADD REPLYlink
0
Entering edit mode

Hi Pierre, Thank you so much! I think this is what I need to look for. I will check on it. BTW, is it possible to do so in R? Best, Xiaoyong

ADD REPLYlink
2
Entering edit mode

Everything can be done in Excel. Unless you go over 64k lines or whatever the latest Microsoft boss thinks is sufficient. But otherwise, everything.

ADD REPLYlink
7
Entering edit mode

I definitely love the convenience function of converting gene names to dates. Wouldn't know how to work without it!

ADD REPLYlink
0
Entering edit mode

The only thing you have to do when importing a file is to select the column containing the gene names and set data type to Text.

ADD REPLYlink
0
Entering edit mode

Yes, that I know. But it's an error which happens easily, silent and irreversible. That shouldn't be possible.

ADD REPLYlink
0
Entering edit mode

I agree, Excel is a powerful tool to work with. However, it takes some time to really grab its essence and logic. Since Excel 2007 tables can be "1,048,576 rows by 16,384 columns".

ADD REPLYlink
1
Entering edit mode

Hmm, I wasn't aware of that. But it seems that Excel is incapable of plotting more that 64k lines in a graph. It doesn't really matter to me how many you can put in a table, if you can't show them in a graph; data that can't be visualized may as well not exist.

"1,048,576 rows by 16,384 columns" is still completely arbitrary. And even if you can paste data in those fields, you can't use it in a graph because Microsoft either lacks skilled programmers or managers. There is no sane reason to restrict people to those limits. If I want to paste a billion lines into Excel, I should be able to do so (if I have sufficient RAM), and it baffles me why Microsoft is incapable of developing tools that can handle the needs of anyone outside of pedestrian users. They are stuck in the "640k is enough for anyone" mindset.

ADD REPLYlink
0
Entering edit mode

I love Excel for copying and pasting data into a sensible matrix so I can then plot it in something pretty (R - ggplot/plotly) ;) also if you're supervisor is among the less tech savvy, they inevitably want all experiments to produce and excel table they can play with

ADD REPLYlink
0
Entering edit mode

You say some of your genes have multiple values, but your example data doesn't. Can you give us a more representative example input file?

ADD REPLYlink
0
Entering edit mode

R dplyr: group_by -> summarise

ADD REPLYlink
3
Entering edit mode
10 months ago
geek_y 9.7k
Barcelona/CRG/London/Imperial

Assuming the tab delimited file:

sort -k1,1 in.txt | groupBy -g 1 -c 2 -o mean > out.txt

out.txt:

A2M -11.045
ACE 1.68
ACYP2   7.16
ADAMTS12    1.71
ADAMTS6 5.36
ADIPOQ  2.885
AGR2    4.785

groupBy from bedtools

ADD COMMENTlink
0
Entering edit mode

Hi Goutham, This looks so cool! I will try it definitely. Thank you so much! Xiaoyong

ADD REPLYlink
0
Entering edit mode

The bedtools groupby command works beautifully to perform such work. Thank you for quick answer! Xiaoyong

ADD REPLYlink
1
Entering edit mode
11 months ago
steve ♦ 2.0k
United States

You can also do this easily in R.

# reading in the data; insert your data import method here:
gene_stats <- "gene\tvalue
A2M\t-3.63
A2M\t-18.46
ACE\t1.68
ACYP2\t8.23
ACYP2\t6.09
ADAMTS12\t1.77
ADAMTS12\t1.65
ADAMTS6\t5.36
ADIPOQ\t4.06
ADIPOQ\t1.71
AGR2\t13.71
AGR2\t8.92
AGR2\t5.05
AGR2\t3.02
AGR2\t2.23
AGR2\t-4.22
"
gene_stats_df_df <- read.table(textConnection(gene_stats), header = TRUE, sep = '\t')

# get the mean for each gene in the table
aggregate(value ~ gene, data = gene_stats_df_df, FUN = mean)

output:

      gene   value
1      A2M -11.045
2      ACE   1.680
3    ACYP2   7.160
4 ADAMTS12   1.710
5  ADAMTS6   5.360
6   ADIPOQ   2.885
7     AGR2   4.785
ADD COMMENTlink
1
Entering edit mode
11 months ago
aquaq • 10

You can also copy the first column (A) to another column(D), then apply remove duplicates to this column. Then, in the E1 cell, you only have to use =AVERAGEIF(A:A, D1, B:B). With double click on the lower right corner of the E1 cell, you can copy this function for all the E column.

ADD COMMENTlink
0
Entering edit mode

Thank you, aquaq. This is I thought the quickest way to solve my problem!

ADD REPLYlink
1
Entering edit mode
16 months ago
arnstrm ♦ 1.7k
Ames, IA

In Excel, select the table, go to Inset, PivotTable, choose the location (Existing Worksheet and select a blank cell). A side bar section PivotTable Fiedls will appear. Drag the Gene field name to Rows and Value to the values. The default will be the sum of values for each gene. To change it click on the drop down menu and then value field settings and choose average

(suggesting only because Pierre mentioned this as the option :P)

screenshot

ADD COMMENTlink
0
Entering edit mode
22 months ago
theobroma22 ♦ 1.1k

X <- aggregate ( x, data, mean)

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1