Collapse Probes For Same Gene
3
2
Entering edit mode
12.2 years ago
Rituriya ▴ 40

Dear All,

If there are more than one rows of expression for the same gene, collapse this gene into one row with highest value (maximum) within the column for that gene.

This file consists of 15 columns of expression values for 15 tissues. Its a text file containing affymetrix probes of Hgu133plus2 annotation. I have tried GATEexplorer, ADAPT, BrainCDF, etc. But none of them is useful to me. Can anyone suggest a solution? I tried using Genepattern also, but my file size is just too huge to accept.

gene • 12k views
ADD COMMENT
9
Entering edit mode
12.2 years ago
Neilfws 49k

This is quite easy in R.

First, you'll need an extra column with the gene names. Assuming your file is tab-delimited with column headers, read it into R:

mydat <- read.table("myfile.txt", header = T, sep = "\t")

Assuming that gene names are in column 16 with the header "gene", that column should now be of class factor:

class(mydat$gene)
[1] "factor"

You can now calculate the maximum by gene name using aggregate:

mydat.max <- aggregate(. ~ gene, data = mydat, max)

The new variable mydat.max is a data frame with gene names in the first column and one row of maximum values.

Just to show a dumb example - if the data frame mydat looks like this:

a    b  gene
1   11     A
2   12     A
3   13     A
4   14     A
5   15     A
6   16     B
7   17     B
8   18     B  
9   19     B
10  20     B

It becomes after aggregate:

gene    a  b
   A    5 15
   B   10 20
ADD COMMENT
0
Entering edit mode

Thank you so much neilfws! This is exactly what I wanted. Thank you once again.

ADD REPLY
2
Entering edit mode
12.2 years ago
ALchEmiXt ★ 1.9k

Why should you want to do that in the first place?

Duplicates for genes on arrays are beneficial for controls, but they usually also allow you to detect differentially expressed variants (including possible splice variants)!? So Just combining them into a single gene value is loosing analysis resolution AND probably dangerous as well.

ADD COMMENT
1
Entering edit mode

It's quite common to collapse probes down to the gene level in some applications and often, a very simple metric such as median is used. You can argue that information is lost, but so is noise.

ADD REPLY
1
Entering edit mode

@neilfws I agree on the data reduction and possibly the noise. But taking the highest value....?

ADD REPLY
0
Entering edit mode

Perhaps the op meant to select the row with the lowest P-value (least likely to occur by chance)? Approach is suggested here.

ADD REPLY
1
Entering edit mode
12.2 years ago

If you have the .CEL file for your array and you wish to summarize it from the probe-level to the gene-level you might try Aroma, Expression Console, Affy Power Tools, RMAExpress, and many others.

Or perhaps, you already have a processed file that contains gene expression values but it still contains some cases where the same gene has multiple values. In that case, many people use scripting for those kinds of file manipulation. For example, using R, Perl, Awk, Python, etc. If that is what you mean, you can try posting a slice of your file and someone may provide examples...

ADD COMMENT
2
Entering edit mode

Perhaps a small snippet/example using any of the packages in your first paragraph would be more informative.

ADD REPLY

Login before adding your answer.

Traffic: 1502 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6