Question

Adding unique index or names to merged data frames

0

Entering edit mode

5.4 years ago

dodausp ▴ 180

Hi everyone,

I am pretty sure most of you can come with a quick solution to that, so sorry for the simple question. Here is the issue, I have multiple files for SNV counts, each corresponding to a sample. I want to merge all into one data frame, but still would like to keep a unique ID for each file, other than the file names themselves (those are too large).

The example:

> file_list <- list.files()
> file_list

[1] "TCGA-A1-A0TT-01.hg19.oncotator.hugo_entrez_remapped.maf.txt" 
[2] "TCGA-A1-A0TG-01.hg19.oncotator.hugo_entrez_remapped.maf.txt"
[3] "TCGA-A1-A0SH-01.hg19.oncotator.hugo_entrez_remapped.maf.txt" 
[4] "TCGA-A1-A0BF-01.hg19.oncotator.hugo_entrez_remapped.maf.txt"
[5] "TCGA-A1-A0XB-01.hg19.oncotator.hugo_entrez_remapped.maf.txt"
[6] "TCGA-A1-A0YI-01.hg19.oncotator.hugo_entrez_remapped.maf.txt"

I am using this line to merge them all:

> datamerge <- do.call("rbind", lapply(file_list, FUN=function(files) {
read.table(files, header=T, sep="\t", fill=T)
}))

I get this:

> datamerge

   Chrom     start       end ref alt   Symbol                   Patient_id     variant_class
1     10 116247760 116247760   T   C   ABLIM1 TCGA-A1-A0TT-01A-11D-A142-09 Missense_Mutation
2     12  43944926  43944926   T   C ADAMTS20 TCGA-A1-A0TT-01A-11D-A142-09 Missense_Mutation
3      3  85932472  85932472   C   T    CADM2 TCGA-A1-A0TG-01A-11D-A142-09            Silent
4      2  25678299  25678299   C   T     DTNB TCGA-A1-A0TG-01A-11D-A142-09 Missense_Mutation
5     17  40272381  40272381   G   A    KAT2A TCGA-A1-A0TG-01A-11D-A142-09            Silent
6      5  80024722  80024722   T   -     MSH3 TCGA-A1-A0TG-01A-11D-A142-09   Frame_Shift_Del
7      6 135507043 135507044   -   A      MYB TCGA-A1-A0SH-01A-11D-A142-09   Frame_Shift_Ins
8     16  74425902  74425902   T   C  NPIPB15 TCGA-A1-A0BF-01A-11D-A142-09 Missense_Mutation
9     22  16449539  16449539   A   G   OR11H1 TCGA-A1-A0BF-01A-11D-A142-09 Missense_Mutation
10    20  16730581  16730581   G   A     OTOR TCGA-A1-A0BF-01A-11D-A142-09 Missense_Mutation
11     X  78216689  78216689   C   T   P2RY10 TCGA-A1-A0XB-01A-11D-A142-09            Silent
12    16  88790292  88790292   T   C   PIEZO1 TCGA-A1-A0XB-01A-11D-A142-09 Missense_Mutation
13     1  44476442  44476442   C   T   SLC6A9 TCGA-A1-A0XB-01A-11D-A142-09 Missense_Mutation
14    17   7491739   7491739   T   G    SOX15 TCGA-A1-A0YI-01A-11D-A142-09 Missense_Mutation

Note that each file has also its own "Patient_id" (with the same prefix plus other strings) on the merged table.

Now, as I said at the beginning I would like to keep my "Patient_id" with unique values to each file merged, however changing its name. Say, from the example here, all my samples are lung cancer (LUAD), so I would like something like this:

   Chrom     start       end ref alt   Symbol   Patient_id     variant_class
1     10 116247760 116247760   T   C   ABLIM1       LUNG_1 Missense_Mutation
2     12  43944926  43944926   T   C ADAMTS20       LUNG_1 Missense_Mutation
3      3  85932472  85932472   C   T    CADM2       LUNG_2            Silent
4      2  25678299  25678299   C   T     DTNB       LUNG_2 Missense_Mutation
5     17  40272381  40272381   G   A    KAT2A       LUNG_2            Silent
6      5  80024722  80024722   T   -     MSH3       LUNG_2   Frame_Shift_Del
7      6 135507043 135507044   -   A      MYB       LUNG_3   Frame_Shift_Ins
8     16  74425902  74425902   T   C  NPIPB15       LUNG_4 Missense_Mutation
9     22  16449539  16449539   A   G   OR11H1       LUNG_4 Missense_Mutation
10    20  16730581  16730581   G   A     OTOR       LUNG_4 Missense_Mutation
11     X  78216689  78216689   C   T   P2RY10       LUNG_5            Silent
12    16  88790292  88790292   T   C   PIEZO1       LUNG_5 Missense_Mutation
13     1  44476442  44476442   C   T   SLC6A9       LUNG_5 Missense_Mutation
14    17   7491739   7491739   T   G    SOX15       LUNG_6 Missense_Mutation

Would you have any workaround for this? Either adding the proper name while merging the files or after it. It doesn't really matter to me.

Any help is very much appreciated! (:

SNP merge index TCGA sequencing • 1.0k views

ADD COMMENT • link updated 5.4 years ago by benformatics 3.9k • written 5.4 years ago by dodausp ▴ 180

score 3 · Accepted Answer · 2018-11-28

Using the plyr package you can do the following:

Here the basic idea from cookbook-r.com:

# A sample factor to work with.
x <- factor(c("alpha","beta","gamma","alpha","beta"))
x
#> [1] alpha beta  gamma alpha beta 
#> Levels: alpha beta gamma

levels(x)
#> [1] "alpha" "beta"  "gamma"

library(plyr)
revalue(x, c("beta"="two", "gamma"="three"))
#> [1] alpha two   three alpha two  
#> Levels: alpha two three

mapvalues(x, from = c("beta", "gamma"), to = c("two", "three"))
#> [1] alpha two   three alpha two  
#> Levels: alpha two three

In your case x is the column from your data.frame datamerge$Patient_id

You may have to convert it to a factor first (e.g. datamerge$Patient_id <- factor(datamerge$Patient_id)) but then mapvalues or revalue should work.

This should work for your exact example:

datamerge$Patient_id <- factor(datamerge$Patient_id)
datamerge$Patient_id <- mapvalues(datamerge$Patient_id,from=unique(datamerge$Patient_id),to=paste0('LUNG_',seq(unique(datamerge$Patient_id))))