Question

How to match and merge two tables in a effective way where sample names are same but all rows are not same.

0

Entering edit mode

6.5 years ago

Mitra • 0

Hi All, I want to match and merge two tables in a effective way where sample names are same but all rows are not same. e.g.

less Test1.csv 

Datasets,RIFA3D33,RIFA3D44,RIFA3D5,RIFA3D23,RIFA3D40,RIFA3D26,RIFA3D12,RIFA3D20,RIFA3D15,RIFB3D15,RIFB3D5,RIFB3D40,RIFB3D26,RIFB3D23,RIFB3D44,RIFB3D33,RIFB3D20,RIFB3D12,RIFC3D12,RIFC3D33,RIFC3D23,RIFC3D15,RIFC3D20,RIFC3D40,RIFC3D44,RIFC3D26,RIFC3D5,RIFSAMPLEA,RIFSAMPLEB,RIFSAMPLEC,RIFSAMPLED
root,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
cellular organisms,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Bacteria,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,3,0
Deferribacteres <phylum>,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Deferribacteres,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Deferribacterales,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Deferribacteraceae,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Mucispirillum,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,1,0,1,1,0,2,0,0,0,1,0,0,0,0,0
FCB group,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Bacteroidetes/Chlorobi group,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Bacteroidetes,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Bacteroidia,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Bacteroidales,9,3,2,3,4,2,8,7,0,5,12,0,3,0,1,5,5,1,2,1,7,1,4,2,3,1,4,306,28,105,41
Bacteroidaceae,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Bacteroides,357,415,648,158,570,445,294,476,160,137,1148,28,181,192,313,472,637,597,65,439,215,113,7,374,491,233,79,343,943,486,790
Odoribacteraceae,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Butyricimonas,0,0,1,0,0,0,0,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,12,135,51
Odoribacter,0,1,7,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,2,1,3,7,37,78
Porphyromonadaceae,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Parabacteroides,1350,734,515,9,1996,3941,291,0,103,672,668,24,1524,15,287,375,3,542,287,747,387,487,3,644,698,387,69,156,1887,419,74
Prevotellaceae,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Prevotella,1,0,1,0,3,0,1,0,1,1,21,1,0,0,0,0,0,20,1,0,0,0,0,0,1,0,0,4,1,10,0
Rikenellaceae,4,2,2888,3,4,3,89,3,19,1066,5370,4,15,0,1,23,3,535,913,8,3797,1219,21,180,314,2103,567,182,1973,1525,1370
Alistipes,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4
Fusobacteria,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Fusobacteriia,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Fusobacteriales,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Fusobacteriaceae,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

And another table

less Test2.csv 

Datasets,RIFB3D20,RIFSAMPLEB,RIFB3D33,RIFSAMPLED,RIFA3D23,RIFSAMPLEC,RIFA3D12,RIFC3D44,RIFC3D40,RIFB3D5,RIFA3D26,RIFC3D26,RIFB3D12,RIFB3D26,RIFSAMPLEA,RIFA3D15,RIFB3D15,RIFC3D23,RIFA3D20,RIFC3D15,RIFB3D44,RIFA3D33,RIFA3D40,RIFC3D12,RIFC3D5,RIFC3D20,RIFC3D33,RIFB3D23,RIFA3D5,RIFA3D44,RIFB3D40
root,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
cellular organisms,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Bacteria,0,1,0,0,0,3,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,1
Deferribacteres <phylum>,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Deferribacteres,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Deferribacterales,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Deferribacteraceae,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Mucispirillum,3,0,4,1,2,2,1,2,1,2,1,3,1,1,1,3,0,0,2,3,2,2,1,1,0,4,1,0,1,1,0
FCB group,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Bacteroidetes/Chlorobi group,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Bacteroidetes,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Bacteroidia,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Bacteroidales,8,113,30,33,9,83,10,17,10,27,10,10,5,17,877,9,28,17,14,21,12,26,23,32,14,16,14,5,29,11,14
Bacteroidaceae,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Bacteroides,194,630,141,427,47,234,114,231,106,846,168,140,146,72,135,94,51,71,147,48,83,119,251,54,129,12,241,58,552,139,15
Odoribacteraceae,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Butyricimonas,0,9,0,13,0,64,0,0,0,2,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0
Odoribacter,0,6,0,36,2,24,1,0,0,1,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0
Porphyromonadaceae,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Parabacteroides,21,1294,861,106,13,348,1049,1221,956,1094,10648,193,1028,4501,278,463,1601,121,8,971,750,3959,7559,955,308,9,1255,48,1340,1791,113
Prevotellaceae,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
Paraprevotella,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
Prevotella,2,0,1,5,2,5,2,2,5,8,2,2,7,4,4,2,4,2,0,0,0,2,3,3,1,0,6,0,6,2,1
Rikenellaceae,18,861,44,756,8,507,69,212,88,1426,18,1907,191,27,135,31,479,1606,14,471,24,16,18,555,415,30,20,7,1163,7,14
Alistipes,0,0,0,35,1,3,1,1,2,20,0,0,1,0,0,0,0,1,0,1,2,1,0,1,0,0,0,0,6,0,0
Proteobacteria,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Alphaproteobacteria,2,72,0,1,2,96,2,2,0,10,2,12,0,2,1,2,1,5,0,2,1,0,0,3,1,1,3,2,6,1,0
Rhodobacterales,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

Both files are .csv format. I want to merge such a way that I can get a union. i.e. to add abundance where I have same taxa in rows. Also to include the exact numbers where some taxa are present in one of the files.

Can anybody please help me? Thanks, Mitra

sequencing • 1.5k views

ADD COMMENT • link updated 6.5 years ago by cpad0112 21k • written 6.5 years ago by Mitra • 0

0

Entering edit mode

6.5 years ago

Devon Ryan 104k

In R with the dplyr package you'd use full_join() for this.

ADD COMMENT • link 6.5 years ago by Devon Ryan 104k

0

Entering edit mode

6.5 years ago

Mitra • 0

Sorry for my late reply. I tried to use the full_join function but (may be I tried wrong), its not adding the rows where we have entry in both data sets. keeping the numbers from first data and only merging the second data in case of new rows. Any help will be really great. Thanks, Mitra

ADD COMMENT • link 6.5 years ago by Mitra • 0

0

Entering edit mode

You'll need to provide a small (a couple lines) example of what you're inputting, what you're receiving, and what you're expecting. Alternatively, look at the dplyr cheat sheet, specifically page 2 under "combine datasets". That give you a visual depiction of what each command will do.

ADD REPLY • link 6.5 years ago by Devon Ryan 104k

0

Entering edit mode

6.5 years ago

Mitra • 0

  > A
                     X.Datasets RIFB3D20 RIFSAMPLEB RIFB3D33 RIFSAMPLED RIFA3D23
10                    FCB group        0          0        0          0        0
11 Bacteroidetes/Chlorobi group        0          0        0          0        0
12                Bacteroidetes        0          0        0          0        0
13                  Bacteroidia        0          0        0          0        0
14                Bacteroidales        8        113       30         33        9
15               Bacteroidaceae        0          0        0          0        0
   RIFSAMPLEC RIFA3D12 RIFC3D44 RIFC3D40
10          0        0        0        0
11          0        0        0        0
12          0        0        0        0
13          0        0        0        0
14         83       10       17       10
15          0        0        0        0

> B
                     X.Datasets RIFA3D33 RIFA3D44 RIFA3D5 RIFA3D23 RIFA3D40 RIFA3D26
10 Bacteroidetes/Chlorobi group        0        0       0        0        0        0
11                Bacteroidetes        0        0       0        0        0        0
12                  Bacteroidia        0        0       0        0        0        0
13                Bacteroidales        9        3       2        3        4        2
14               Bacteroidaceae        0        0       0        0        0        0
15                  Bacteroides      357      415     648      158      570      445
   RIFA3D12 RIFA3D20 RIFA3D15
10        0        0        0
11        0        0        0
12        0        0        0
13        8        7        0
14        0        0        0
15      294      476      160


  > MerAB<-full_join(A,B,by=NULL)

Joining, by = c("X.Datasets", "RIFA3D23", "RIFA3D12")
Warning message:
In full_join_impl(x, y, by$x, by$y, suffix$x, suffix$y) :
  joining factors with different levels, coercing to character vector
> MerAB
                    X.Datasets RIFB3D20 RIFSAMPLEB RIFB3D33 RIFSAMPLED RIFA3D23
1                    FCB group        0          0        0          0        0
2 Bacteroidetes/Chlorobi group        0          0        0          0        0
3                Bacteroidetes        0          0        0          0        0
4                  Bacteroidia        0          0        0          0        0
5                Bacteroidales        8        113       30         33        9
6               Bacteroidaceae        0          0        0          0        0
7                Bacteroidales       NA         NA       NA         NA        3
8                  Bacteroides       NA         NA       NA         NA      158
  RIFSAMPLEC RIFA3D12 RIFC3D44 RIFC3D40 RIFA3D33 RIFA3D44 RIFA3D5 RIFA3D40 RIFA3D26
1          0        0        0        0       NA     <NA>      NA       NA       NA
2          0        0        0        0        0        0       0        0        0
3          0        0        0        0        0        0       0        0        0
4          0        0        0        0        0        0       0        0        0
5         83       10       17       10       NA     <NA>      NA       NA       NA
6          0        0        0        0        0        0       0        0        0
7         NA        8       NA       NA        9        3       2        4        2
8         NA      294       NA       NA      357      415     648      570      445
  RIFA3D20 RIFA3D15
1       NA       NA
2        0        0
3        0        0
4        0        0
5       NA       NA
6        0        0
7        7        0
8      476      160

For example here in MerAB "Bacteroidales" has taken the values for A only. I want all the same taxa which is present in both the datasets to be added and extra ones are joined from either A or B. Hope this example helps? Thanks, Mitra

ADD COMMENT • link 6.5 years ago by Mitra • 0

0

Entering edit mode

MerAB <- full_join(A, B, by='X.Datasets')

You can see in the Joining, by line that it printed out that it sees more than one common column name.

ADD REPLY • link 6.5 years ago by Devon Ryan 104k

0

Entering edit mode

No this one only duplicate columns of all datasets by adding.x and .y

> dim(A)
[1]  6 32
> dim(B)
[1]  6 32
> dim(MerAB)
[1]  7 63

So not the intended result at all. Thanks anyway. M

ADD REPLY • link 6.5 years ago by Mitra • 0

0

Entering edit mode

6.5 years ago

Mitra • 0

My original two datasets dimensiona are 256 by 32 and 270 by 32.

All the top data name (32) are same for both. Only the rows are different. I wasnt these teow data combined (added in case of same rows.) Thanks. M

ADD COMMENT • link 6.5 years ago by Mitra • 0

score 2 · Accepted Answer · 2017-10-16

test1 and test2 are copy/pasted from OP. R code below merges both the files (loaded from . csv) and row values are added (sum).

In R:

test1=read.csv("test1.csv", header = T, sep = ",", stringsAsFactors = F)
test2=read.csv("test2.csv", header = T, sep = ",", stringsAsFactors = F)
test1=(test1[,sort(colnames(test1))])
test2=(test2[,sort(colnames(test2))])
all(colnames(test1)==colnames(test2))
library(dplyr)
out.df=rbind(test1,test2) %>% group_by(Datasets) %>% summarise_all(sum)
View(out.df)

If you do not want to sort the columns (variables), then code would be:

test1=read.csv("test1.csv", header = T, sep = ",", stringsAsFactors = F)
test2=read.csv("test2.csv", header = T, sep = ",", stringsAsFactors = F)
library(dplyr)
out.df=rbind(test1,test2) %>% group_by(Datasets) %>% summarise_all(sum)
View(out.df)