Biostar Beta. Not for public use.
Comparing two columns in a data frame across many rows using R language
0
Entering edit mode
4.6 years ago
fydhyun • 0

I have a data frame that I'm working within which I'd like to compare the contents inside two columns; PathwayName and ExpressionData. This comparison will be done across many rows (10,695,840 entries) using R language.

Here are the first few lines of my data frame where the contents inside are only separated by whitespace.

          PathwayName                                      ExpressionData 
1   41bbPathway BLACK   215538_at   210671_x_at...   215538_at  na  28.566616...
2   ace2Pathway BLACK   214533_at   215184_at...     215538_at  na  28.566616...    
3   acetPathway BLACK   215184_at   01502_s_at...    215184_at  na  4.2084746...
4   achPathway  BLACK   211570_s_at 215184_at...     215184_at  na  4.2084746...
5   hoPathway   BLACK   201968_at   214578_s_at...   201968_at  na  472.4969...

As a final product, I want it to compare, copy and save into a new file where the output should be like this:

    PathwayName               ExpressionData 
1   41bbPathway 215538_at     215538_at         
2   acetPathway 215184_at     215184_at 
3   achPathway  215184_at     215184_at 
4   hoPathway   201968_at     201968_at  

Everything that I'd done were failed because most of them compare by rows and not the contents inside.

hope there are people who can help.

thank you

R • 1.4k views
ADD COMMENTlink
0
Entering edit mode

You will have to specify what comparison you'd like to perform. Also, is it that your data frame is

PathwayName ExpressionData
41bbPathway

| Black | 215538_at | 210671_x_at... | 215538_at | na | 28.566...

ace2Pathway

| Black | 214533_at | 215184_at... | 215538_at | na | 28.566...

Also, it is unclear what you want to get as a final product. Could you please clarify?

ADD REPLYlink
0
Entering edit mode

my data frame is like this, with only 2 columns and many rows.

the contents inside is only separated by a whitespace.

PathwayName ExpressionData

41bbPathway BLACK 201502_s_at 210671_x_at 207539_s_at 207538_at 205446_s_at 203837_at 214786_at 201465_s_at 201466_s_at 211087_x_at 210354_at 209341_s_at 209342_s_at 210449_x_at 207849_at 211561_x_at 209666_s_at 203836_s_at 201464_x_at 209239_at 211081_s_at 209878_s_at 204413_at 211786_at 210477_x_at 201783_s_at 203552_at 207536_s_at 212984_at 202530_at 203553_s_at 213281_at 206907_at 211027_s_at

|

201502_s_at na 2180.5 1410.3849 1446.5948 1465.2518 1763.2852 1650.8464 990.2145 1664.4943 1331.3612 977.69604 1327.1173 1257.0425 1584.2706 1505.1035 892.7254 1727.603 1211.1047 1233.3248 1384.633 1189.2731 926.0287 1427.7644 1392.3889 1593.7533 1185.8496 1460.3658 1196.827 1257.5494 1216.057 1561.1877 1791.2827 1900.0209

ace2Pathway BLACK 214533_at 216893_s_at 211980_at 216896_at 208016_s_at 216368_s_at 210945_at 214602_at 211981_at 222257_s_at 214641_at 206367_at 219962_at 205357_s_at 211964_at 222321_at 202834_at 213992_at 211473_s_at 213110_s_at 207293_s_at 222073_at 216898_s_at 216367_at 207294_at 212966_at

| |

211966_at na 1.6578839 22.724152 15.242467 5.8232727 3.405121 4.8800807 21.504908 20.873146 18.703909 6.2823844 27.864975 18.29098 30.549957 1.534856 19.892557 27.367382 8.975096 25.638044 8.189315 16.064909 6.4715347 23.953142 31.584286 30.580189 28.84765 23.822178 16.611525 17.259954 11.141476 16.319036 14.997593 16.07276


the final product that I want is, a new column which contain the pathway name and the expression data that is similar with the one inside pathway name like below.

PathwayName ExpressionData
41bbPathway BLACK 201502_s_at 210671_x_at 207539_s_at 207538_at 205446_s_at 203837_at 214786_at 201465_s_at 201466_s_at 211087_x_at 210354_at 209341_s_at 209342_s_at 210449_x_at 207849_at 211561_x_at 209666_s_at 203836_s_at 201464_x_at 209239_at 211081_s_at 209878_s_at 204413_at 211786_at 210477_x_at 201783_s_at 203552_at 207536_s_at 212984_at 202530_at 203553_s_at 213281_at 206907_at 211027_s_at 201502_s_at na 2180.5 1410.3849 1446.5948 1465.2518 1763.2852 1650.8464 990.2145 1664.4943 1331.3612 977.69604 1327.1173 1257.0425 1584.2706 1505.1035 892.7254 1727.603 1211.1047 1233.3248 1384.633 1189.2731 926.0287 1427.7644 1392.3889 1593.7533 1185.8496 1460.3658 1196.827 1257.5494 1216.057 1561.1877 1791.2827 1900.0209
- -
ADD REPLYlink
0
Entering edit mode

It seems that you have two data sets and try to combine them in the most non-intuitive way.
As Sam already stated, the Expression data -which you say is one entry- is a set of Probe-ID and value entries.
The first part needs to be disentangled. You need an entry for each pathway-name probe-ID combination.
Having then two tables you can get an join:

pathway table:

PathwayName

| probeID
---|---
41bbPathway | 215538_at
41bbPathway | 210671_x_at

expression table

probeID expression
215538_at na
210671_x_at 6.2823844

merge(pathway, expression, by="probe ID ")

Sometimes you get per probeID a lot of values, than you have to proceed according to the underlying experiment. In case of x test and y control samples, you can compute the log fold-change, or differential expression. If it is only one underlying condition, you can compute the mean or median.
Nevertheless, try to understand the data.

[Found a typo in the merge command]

ADD REPLYlink
0
Entering edit mode
16 months ago
Sam ♦ 2.3k
New York

Your data format is extreme messy so I will assume that your pathway name will always be the third element in pathway name and the first element in expression.

Then you can use the following code:

dat = data.frame(PathwayName=c("41bbPathway BLACK 201502_s_at 210671_x_at 207539_s_at 207538_at 205446_s_at 203837_at 214786_at 201465_s_at 201466_s_at 211087_x_at 210354_at 209341_s_at 209342_s_at 210449_x_at 207849_at 211561_x_at 209666_s_at 203836_s_at 201464_x_at 209239_at 211081_s_at 209878_s_at 204413_at 211786_at 210477_x_at 201783_s_at 203552_at 207536_s_at 212984_at 202530_at 203553_s_at 213281_at 206907_at 211027_s_at","ace2Pathway BLACK 214533_at 216893_s_at 211980_at 216896_at 208016_s_at 216368_s_at 210945_at 214602_at 211981_at 222257_s_at 214641_at 206367_at 219962_at 205357_s_at 211964_at 222321_at 202834_at 213992_at 211473_s_at 213110_s_at 207293_s_at 222073_at 216898_s_at 216367_at 207294_at 212966_at"),ExpressionData=c("201502_s_at na 2180.5 1410.3849 1446.5948 1465.2518 1763.2852 1650.8464 990.2145 1664.4943 1331.3612 977.69604 1327.1173 1257.0425 1584.2706 1505.1035 892.7254 1727.603 1211.1047 1233.3248 1384.633 1189.2731 926.0287 1427.7644 1392.3889 1593.7533 1185.8496 1460.3658 1196.827 1257.5494 1216.057 1561.1877 1791.2827 1900.0209","211966_at na 1.6578839 22.724152 15.242467 5.8232727 3.405121 4.8800807 21.504908 20.873146 18.703909 6.2823844 27.864975 18.29098 30.549957 1.534856 19.892557 27.367382 8.975096 25.638044 8.189315 16.064909 6.4715347 23.953142 31.584286 30.580189 28.84765 23.822178 16.611525 17.259954 11.141476 16.319036 14.997593 16.07276"))

dat$Exp=sapply(strsplit(as.character(dat$ExpressionData), " "),"[[", **1** )
dat$Path=sapply(strsplit(as.character(dat$PathwayName), " "),"[[", **3** )
dat[dat$Exp==dat$Path,]

That should give you what you want.

(The first line of code is only use for recreating your data frame. The only code you'll need is the last three line of codes)

ADD COMMENTlink
0
Entering edit mode

thank you but __ how if I want to increment the element in PathwayName so that it is not rigid to the third element but can compare until the last element* until find the similar one as the first element in ExpressionData?

*the number of element in PathwayName is not always same

ADD REPLYlink
0
Entering edit mode

Well, that will become way more complicated. If that is the case, it might actually be easier for you to reformat your data. Do look into michael.ante's reply.

What you want is still doable, but will be too time consuming and might take a long time to complete.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3.1