Comparing two columns in a data frame across many rows using R language
1
0
Entering edit mode
8.2 years ago
fydhyun • 0

I have a data frame that I'm working within which I'd like to compare the contents inside two columns: PathwayName and ExpressionData. This comparison will be done across many rows (10,695,840 entries) using R language.

Here are the first few lines of my data frame where the contents inside are only separated by white space.

          PathwayName                                      ExpressionData 
1   41bbPathway BLACK   215538_at   210671_x_at...   215538_at  na  28.566616...
2   ace2Pathway BLACK   214533_at   215184_at...     215538_at  na  28.566616...
3   acetPathway BLACK   215184_at   01502_s_at...    215184_at  na  4.2084746...
4   achPathway  BLACK   211570_s_at 215184_at...     215184_at  na  4.2084746...
5   hoPathway   BLACK   201968_at   214578_s_at...   201968_at  na  472.4969...

As a final product, I want it to compare, copy and save into a new file where the output should be like this:

    PathwayName               ExpressionData 
1   41bbPathway 215538_at     215538_at
2   acetPathway 215184_at     215184_at
3   achPathway  215184_at     215184_at
4   hoPathway   201968_at     201968_at

Everything that I'd done were failed because most of them compare by rows and not the contents inside.

Hope there are people who can help.

Thank you

R • 4.1k views
ADD COMMENT
0
Entering edit mode

You will have to specify what comparison you'd like to perform. Also, is it that your data frame is

PathwayName                                                                    ExpressionData
41bbPathway    Black     215538_at     210671_x_at...     215538_at     na     28.566...
ace2Pathway    Black     214533_at     215184_at...       215538_at     na     28.566...

Also, it is unclear what you want to get as a final product. Could you please clarify?

ADD REPLY
0
Entering edit mode

My data frame is like this, with only 2 columns and many rows.

The contents inside is only separated by a whitespace.

PathwayName                                                                ExpressionData

41bbPathway BLACK 201502_s_at 210671_x_at 207539_s_at 207538_at            201502_s_at na 2180.5 1410.3849 1446.5948 1465.2518 
  205446_s_at 203837_at 214786_at 201465_s_at 201466_s_at 211087_x_at        1763.2852 1650.8464 990.2145 1664.4943 1331.3612 
  210354_at 209341_s_at 209342_s_at 210449_x_at 207849_at 211561_x_at        977.69604 1327.1173 1257.0425 1584.2706 1505.1035 
  209666_s_at 203836_s_at 201464_x_at 209239_at 211081_s_at 209878_s_at      892.7254 1727.603 1211.1047 1233.3248 1384.633 
  204413_at 211786_at 210477_x_at 201783_s_at 203552_at 207536_s_at          1189.2731 926.0287 1427.7644 1392.3889 1593.7533 
  212984_at 202530_at 203553_s_at 213281_at 206907_at 211027_s_at            1185.8496 1460.3658 1196.827 1257.5494 1216.057 
                                                                             1561.1877 1791.2827 1900.0209

ace2Pathway BLACK 214533_at 216893_s_at 211980_at 216896_at 208016_s_at    211966_at na 1.6578839 22.724152 15.242467 5.8232727 
  216368_s_at 210945_at 214602_at 211981_at 222257_s_at 214641_at            3.405121 4.8800807 21.504908 20.873146 18.703909 
  206367_at 219962_at 205357_s_at 211964_at 222321_at 202834_at              6.2823844 27.864975 18.29098 30.549957 1.534856 
  213992_at 211473_s_at 213110_s_at 207293_s_at 222073_at                    19.892557 27.367382 8.975096 25.638044 8.189315 
  216898_s_at 216367_at 207294_at 212966_at                                  16.064909 6.4715347 23.953142 31.584286 30.580189 
                                                                             28.84765 23.822178 16.611525 17.259954 11.141476 
                                                                             16.319036 14.997593 16.07276

The final product that I want is, a new column which contain the pathway name and the expression data that is similar with the one inside pathway name like below.

PathwayName                                                                ExpressionData

41bbPathway BLACK 201502_s_at 210671_x_at 207539_s_at 207538_at            201502_s_at na 2180.5 1410.3849 1446.5948 1465.2518 
  205446_s_at 203837_at 214786_at 201465_s_at 201466_s_at 211087_x_at        1763.2852 1650.8464 990.2145 1664.4943 1331.3612 
  210354_at 209341_s_at 209342_s_at 210449_x_at 207849_at 211561_x_at        977.69604 1327.1173 1257.0425 1584.2706 1505.1035 
  209666_s_at 203836_s_at 201464_x_at 209239_at 211081_s_at 209878_s_at      892.7254 1727.603 1211.1047 1233.3248 1384.633 
  204413_at 211786_at 210477_x_at 201783_s_at 203552_at 207536_s_at          1189.2731 926.0287 1427.7644 1392.3889 1593.7533 
  212984_at 202530_at 203553_s_at 213281_at 206907_at 211027_s_at            1185.8496 1460.3658 1196.827 1257.5494 1216.057 
                                                                             1561.1877 1791.2827 1900.0209

-                                                                             -
ADD REPLY
0
Entering edit mode

It seems that you have two data sets and try to combine them in the most non-intuitive way.

As Sam already stated, the Expression data -which you say is one entry- is a set of Probe-ID and value entries.

The first part needs to be disentangled. You need an entry for each pathway-name probe-ID combination. Having then two tables you can get an join:

Pathway table:

PathwayName    probeID
41bbPathway    215538_at
41bbPathway    210671_x_at

Expression table

probeID        expression
215538_at      na
210671_x_at    6.2823844
merge(pathway, expression, by="probeID")

Sometimes you get per probeID a lot of values, than you have to proceed according to the underlying experiment. In case of x test and y control samples, you can compute the log fold-change, or differential expression. If it is only one underlying condition, you can compute the mean or median. Nevertheless, try to understand the data.

[Found a typo in the merge command]

ADD REPLY
0
Entering edit mode
8.2 years ago
Sam ★ 4.7k

Your data format is extreme messy so I will assume that your pathway name will always be the third element in pathway name and the first element in expression.

Then you can use the following code:

dat = data.frame(PathwayName=c("41bbPathway BLACK 201502_s_at 210671_x_at 207539_s_at 207538_at 205446_s_at 203837_at 214786_at 201465_s_at 201466_s_at 211087_x_at 210354_at 209341_s_at 209342_s_at 210449_x_at 207849_at 211561_x_at 209666_s_at 203836_s_at 201464_x_at 209239_at 211081_s_at 209878_s_at 204413_at 211786_at 210477_x_at 201783_s_at 203552_at 207536_s_at 212984_at 202530_at 203553_s_at 213281_at 206907_at 211027_s_at","ace2Pathway BLACK 214533_at 216893_s_at 211980_at 216896_at 208016_s_at 216368_s_at 210945_at 214602_at 211981_at 222257_s_at 214641_at 206367_at 219962_at 205357_s_at 211964_at 222321_at 202834_at 213992_at 211473_s_at 213110_s_at 207293_s_at 222073_at 216898_s_at 216367_at 207294_at 212966_at"),ExpressionData=c("201502_s_at na 2180.5 1410.3849 1446.5948 1465.2518 1763.2852 1650.8464 990.2145 1664.4943 1331.3612 977.69604 1327.1173 1257.0425 1584.2706 1505.1035 892.7254 1727.603 1211.1047 1233.3248 1384.633 1189.2731 926.0287 1427.7644 1392.3889 1593.7533 1185.8496 1460.3658 1196.827 1257.5494 1216.057 1561.1877 1791.2827 1900.0209","211966_at na 1.6578839 22.724152 15.242467 5.8232727 3.405121 4.8800807 21.504908 20.873146 18.703909 6.2823844 27.864975 18.29098 30.549957 1.534856 19.892557 27.367382 8.975096 25.638044 8.189315 16.064909 6.4715347 23.953142 31.584286 30.580189 28.84765 23.822178 16.611525 17.259954 11.141476 16.319036 14.997593 16.07276"))

dat$Exp=sapply(strsplit(as.character(dat$ExpressionData), " "),"[[",1)
dat$Path=sapply(strsplit(as.character(dat$PathwayName), " "),"[[",3)
dat[dat$Exp==dat$Path,]

That should give you what you want.

(The first line of code is only use for recreating your data frame. The only code you'll need is the last three line of codes)

ADD COMMENT
0
Entering edit mode

Thank you but how if I want to increment the element in PathwayName so that it is not rigid to the third element but can compare until the last element until find the similar one as the first element in ExpressionData?

the number of element in PathwayName is not always same

ADD REPLY
0
Entering edit mode

Well, that will become way more complicated. If that is the case, it might actually be easier for you to reformat your data. Do look into michael.ante's reply.

What you want is still doable, but will be too time consuming and might take a long time to complete.

ADD REPLY

Login before adding your answer.

Traffic: 1669 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6