How to match and merge two tables in a effective way where sample names are same but all rows are not same.
5
0
Entering edit mode
6.5 years ago
Mitra • 0

Hi All, I want to match and merge two tables in a effective way where sample names are same but all rows are not same. e.g.

less Test1.csv 

Datasets,RIFA3D33,RIFA3D44,RIFA3D5,RIFA3D23,RIFA3D40,RIFA3D26,RIFA3D12,RIFA3D20,RIFA3D15,RIFB3D15,RIFB3D5,RIFB3D40,RIFB3D26,RIFB3D23,RIFB3D44,RIFB3D33,RIFB3D20,RIFB3D12,RIFC3D12,RIFC3D33,RIFC3D23,RIFC3D15,RIFC3D20,RIFC3D40,RIFC3D44,RIFC3D26,RIFC3D5,RIFSAMPLEA,RIFSAMPLEB,RIFSAMPLEC,RIFSAMPLED
root,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
cellular organisms,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Bacteria,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,3,0
Deferribacteres <phylum>,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Deferribacteres,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Deferribacterales,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Deferribacteraceae,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Mucispirillum,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,1,0,1,1,0,2,0,0,0,1,0,0,0,0,0
FCB group,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Bacteroidetes/Chlorobi group,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Bacteroidetes,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Bacteroidia,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Bacteroidales,9,3,2,3,4,2,8,7,0,5,12,0,3,0,1,5,5,1,2,1,7,1,4,2,3,1,4,306,28,105,41
Bacteroidaceae,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Bacteroides,357,415,648,158,570,445,294,476,160,137,1148,28,181,192,313,472,637,597,65,439,215,113,7,374,491,233,79,343,943,486,790
Odoribacteraceae,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Butyricimonas,0,0,1,0,0,0,0,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,12,135,51
Odoribacter,0,1,7,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,2,1,3,7,37,78
Porphyromonadaceae,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Parabacteroides,1350,734,515,9,1996,3941,291,0,103,672,668,24,1524,15,287,375,3,542,287,747,387,487,3,644,698,387,69,156,1887,419,74
Prevotellaceae,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Prevotella,1,0,1,0,3,0,1,0,1,1,21,1,0,0,0,0,0,20,1,0,0,0,0,0,1,0,0,4,1,10,0
Rikenellaceae,4,2,2888,3,4,3,89,3,19,1066,5370,4,15,0,1,23,3,535,913,8,3797,1219,21,180,314,2103,567,182,1973,1525,1370
Alistipes,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4
Fusobacteria,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Fusobacteriia,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Fusobacteriales,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Fusobacteriaceae,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

And another table

less Test2.csv 

Datasets,RIFB3D20,RIFSAMPLEB,RIFB3D33,RIFSAMPLED,RIFA3D23,RIFSAMPLEC,RIFA3D12,RIFC3D44,RIFC3D40,RIFB3D5,RIFA3D26,RIFC3D26,RIFB3D12,RIFB3D26,RIFSAMPLEA,RIFA3D15,RIFB3D15,RIFC3D23,RIFA3D20,RIFC3D15,RIFB3D44,RIFA3D33,RIFA3D40,RIFC3D12,RIFC3D5,RIFC3D20,RIFC3D33,RIFB3D23,RIFA3D5,RIFA3D44,RIFB3D40
root,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
cellular organisms,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Bacteria,0,1,0,0,0,3,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,1
Deferribacteres <phylum>,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Deferribacteres,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Deferribacterales,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Deferribacteraceae,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Mucispirillum,3,0,4,1,2,2,1,2,1,2,1,3,1,1,1,3,0,0,2,3,2,2,1,1,0,4,1,0,1,1,0
FCB group,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Bacteroidetes/Chlorobi group,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Bacteroidetes,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Bacteroidia,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Bacteroidales,8,113,30,33,9,83,10,17,10,27,10,10,5,17,877,9,28,17,14,21,12,26,23,32,14,16,14,5,29,11,14
Bacteroidaceae,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Bacteroides,194,630,141,427,47,234,114,231,106,846,168,140,146,72,135,94,51,71,147,48,83,119,251,54,129,12,241,58,552,139,15
Odoribacteraceae,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Butyricimonas,0,9,0,13,0,64,0,0,0,2,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0
Odoribacter,0,6,0,36,2,24,1,0,0,1,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0
Porphyromonadaceae,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Parabacteroides,21,1294,861,106,13,348,1049,1221,956,1094,10648,193,1028,4501,278,463,1601,121,8,971,750,3959,7559,955,308,9,1255,48,1340,1791,113
Prevotellaceae,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
Paraprevotella,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
Prevotella,2,0,1,5,2,5,2,2,5,8,2,2,7,4,4,2,4,2,0,0,0,2,3,3,1,0,6,0,6,2,1
Rikenellaceae,18,861,44,756,8,507,69,212,88,1426,18,1907,191,27,135,31,479,1606,14,471,24,16,18,555,415,30,20,7,1163,7,14
Alistipes,0,0,0,35,1,3,1,1,2,20,0,0,1,0,0,0,0,1,0,1,2,1,0,1,0,0,0,0,6,0,0
Proteobacteria,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Alphaproteobacteria,2,72,0,1,2,96,2,2,0,10,2,12,0,2,1,2,1,5,0,2,1,0,0,3,1,1,3,2,6,1,0
Rhodobacterales,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

Both files are .csv format. I want to merge such a way that I can get a union. i.e. to add abundance where I have same taxa in rows. Also to include the exact numbers where some taxa are present in one of the files.

Can anybody please help me? Thanks, Mitra

sequencing • 1.5k views
ADD COMMENT
2
Entering edit mode
6.5 years ago

test1 and test2 are copy/pasted from OP. R code below merges both the files (loaded from . csv) and row values are added (sum).

In R:

test1=read.csv("test1.csv", header = T, sep = ",", stringsAsFactors = F)
test2=read.csv("test2.csv", header = T, sep = ",", stringsAsFactors = F)
test1=(test1[,sort(colnames(test1))])
test2=(test2[,sort(colnames(test2))])
all(colnames(test1)==colnames(test2))
library(dplyr)
out.df=rbind(test1,test2) %>% group_by(Datasets) %>% summarise_all(sum)
View(out.df)

If you do not want to sort the columns (variables), then code would be:

test1=read.csv("test1.csv", header = T, sep = ",", stringsAsFactors = F)
test2=read.csv("test2.csv", header = T, sep = ",", stringsAsFactors = F)
library(dplyr)
out.df=rbind(test1,test2) %>% group_by(Datasets) %>% summarise_all(sum)
View(out.df)
ADD COMMENT
0
Entering edit mode

Thats perfectly what I need. THank you very much cpad0112. Many thanks. Mitra.

ADD REPLY
0
Entering edit mode
6.5 years ago

In R with the dplyr package you'd use full_join() for this.

ADD COMMENT
0
Entering edit mode
6.5 years ago
Mitra • 0

Sorry for my late reply. I tried to use the full_join function but (may be I tried wrong), its not adding the rows where we have entry in both data sets. keeping the numbers from first data and only merging the second data in case of new rows. Any help will be really great. Thanks, Mitra

ADD COMMENT
0
Entering edit mode

You'll need to provide a small (a couple lines) example of what you're inputting, what you're receiving, and what you're expecting. Alternatively, look at the dplyr cheat sheet, specifically page 2 under "combine datasets". That give you a visual depiction of what each command will do.

ADD REPLY
0
Entering edit mode
6.5 years ago
Mitra • 0
  > A
                     X.Datasets RIFB3D20 RIFSAMPLEB RIFB3D33 RIFSAMPLED RIFA3D23
10                    FCB group        0          0        0          0        0
11 Bacteroidetes/Chlorobi group        0          0        0          0        0
12                Bacteroidetes        0          0        0          0        0
13                  Bacteroidia        0          0        0          0        0
14                Bacteroidales        8        113       30         33        9
15               Bacteroidaceae        0          0        0          0        0
   RIFSAMPLEC RIFA3D12 RIFC3D44 RIFC3D40
10          0        0        0        0
11          0        0        0        0
12          0        0        0        0
13          0        0        0        0
14         83       10       17       10
15          0        0        0        0

> B
                     X.Datasets RIFA3D33 RIFA3D44 RIFA3D5 RIFA3D23 RIFA3D40 RIFA3D26
10 Bacteroidetes/Chlorobi group        0        0       0        0        0        0
11                Bacteroidetes        0        0       0        0        0        0
12                  Bacteroidia        0        0       0        0        0        0
13                Bacteroidales        9        3       2        3        4        2
14               Bacteroidaceae        0        0       0        0        0        0
15                  Bacteroides      357      415     648      158      570      445
   RIFA3D12 RIFA3D20 RIFA3D15
10        0        0        0
11        0        0        0
12        0        0        0
13        8        7        0
14        0        0        0
15      294      476      160


  > MerAB<-full_join(A,B,by=NULL)

Joining, by = c("X.Datasets", "RIFA3D23", "RIFA3D12")
Warning message:
In full_join_impl(x, y, by$x, by$y, suffix$x, suffix$y) :
  joining factors with different levels, coercing to character vector
> MerAB
                    X.Datasets RIFB3D20 RIFSAMPLEB RIFB3D33 RIFSAMPLED RIFA3D23
1                    FCB group        0          0        0          0        0
2 Bacteroidetes/Chlorobi group        0          0        0          0        0
3                Bacteroidetes        0          0        0          0        0
4                  Bacteroidia        0          0        0          0        0
5                Bacteroidales        8        113       30         33        9
6               Bacteroidaceae        0          0        0          0        0
7                Bacteroidales       NA         NA       NA         NA        3
8                  Bacteroides       NA         NA       NA         NA      158
  RIFSAMPLEC RIFA3D12 RIFC3D44 RIFC3D40 RIFA3D33 RIFA3D44 RIFA3D5 RIFA3D40 RIFA3D26
1          0        0        0        0       NA     <NA>      NA       NA       NA
2          0        0        0        0        0        0       0        0        0
3          0        0        0        0        0        0       0        0        0
4          0        0        0        0        0        0       0        0        0
5         83       10       17       10       NA     <NA>      NA       NA       NA
6          0        0        0        0        0        0       0        0        0
7         NA        8       NA       NA        9        3       2        4        2
8         NA      294       NA       NA      357      415     648      570      445
  RIFA3D20 RIFA3D15
1       NA       NA
2        0        0
3        0        0
4        0        0
5       NA       NA
6        0        0
7        7        0
8      476      160

For example here in MerAB "Bacteroidales" has taken the values for A only. I want all the same taxa which is present in both the datasets to be added and extra ones are joined from either A or B. Hope this example helps? Thanks, Mitra

ADD COMMENT
0
Entering edit mode
MerAB <- full_join(A, B, by='X.Datasets')

You can see in the Joining, by line that it printed out that it sees more than one common column name.

ADD REPLY
0
Entering edit mode

No this one only duplicate columns of all datasets by adding.x and .y

> dim(A)
[1]  6 32
> dim(B)
[1]  6 32
> dim(MerAB)
[1]  7 63

So not the intended result at all. Thanks anyway. M

ADD REPLY
0
Entering edit mode
6.5 years ago
Mitra • 0

My original two datasets dimensiona are 256 by 32 and 270 by 32.

All the top data name (32) are same for both. Only the rows are different. I wasnt these teow data combined (added in case of same rows.) Thanks. M

ADD COMMENT

Login before adding your answer.

Traffic: 2591 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6