Question

R programming question: genotyping data table manipulation2

2

Entering edit mode

9.0 years ago

MAPK ★ 2.1k

Hi Guys,

I have these two dataframes, df1 and df2. df1 with the alleles and df2 with the genotypes. There are more than 50 samples (1:50) comprised of Geno1.GT, Geno1.AD, Geno2.GT, Geno2.AD, ... Geno50.GT, Geno50.AD genotypes and depth coverages interleaving one after other. How do I get the columns matching only Geno and sample number (i.e skipping AD or GT extensions) (e.g., columns Geno1, Geno1.GT and Geno1.AD together) and get the result table. Thank you.

df1

Geno1     Geno2     Geno3
A         A         A
C         G         C
C         A         G

df2

Geno1.GT     Geno1.AD     Geno2.GT     Geno2.AD     Geno3.GT     Geno3.AD
0/0          22,3         0/0          33,2         0/0          33,3
0/0          2,0          0/1          22,3         1/1          43,33
0/1          55,45        0/0          32,2         1/1          22,3

Result

Geno1     Geno1.GT     Geno1.AD     Geno2     Geno2.GT     Geno2.AD     Geno3     Geno3.GT     Geno3.AD
A         0/0          22,3         A         0/0          33,2         A         0/0          33,3
C         0/0          2,0          G         0/1          22,3         C         1/1          43,33
C         0/1          55,45        A         0/0          32,2         G         1/1          22,3

R • 2.3k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 9.0 years ago by MAPK ★ 2.1k

0

Entering edit mode

Are the rows of df1 and df2 matching and in the same order?

ADD REPLY • link 9.0 years ago by Sean Davis 26k

0

Entering edit mode

Thank you Sean for your reply. There are more colnames in df1 Geno1:Geno100 or more. So all the the columns in df2 are present in df1, but not the other way around. DF1 is bigger than df2 in samples and the order is also different.

ADD REPLY • link 9.0 years ago by MAPK ★ 2.1k

0

Entering edit mode

Sean asked about the rows. If the rows are in the same order, merging is going to be much easier.

ADD REPLY • link updated 21 months ago by Ram 43k • written 9.0 years ago by karl.stamm 4.1k

0

Entering edit mode

Sorry, Yes the rows are in same order and equal in length.

ADD REPLY • link updated 21 months ago by Ram 43k • written 9.0 years ago by MAPK ★ 2.1k

Ram · Answer 1 · 2015-04-30

Assuming things are in order and you don't have to do any actual pattern matching, the answer could be as simple as:

# bind them together to form one dataframe
result.df <- cbind(df1,df2)
# rearrange the columns using a pattern of numbers
result.df <- result.df[,as.vector(t(cbind(1:50,seq(51,150,2),seq(52,150,2))))]

Since there would be a regular pattern of numbers by which to arrange the columns of the two dataframes you can specify the patterns and then get them in a series of numbers to interleave the columns as you wish. (also assumes your first df is 50 columns and the second has twice as many - the code be easily modified to use the actual number).

edit: if you have to perform pattern matching because the columns of the data frames are not in any coordinated order, you can use a similar strategy, but it's a little more complicated. See below - first a toy example with vectors matching your names, then the code applied to data frames.

### Proof of principle using simple vectors
# create allele names
df1 <- paste("Geno", 1:10, sep="")

# create genotype names
df2 <- paste("Geno", rep(1:10, each=2), c(".GT",".AD"), sep="")

# find the matching genotype and depth for each allele
gt.iv <- match(df1,sub(".GT","",df2))
ad.iv <- match(df1,sub(".AD","",df2))

# randomly scramble df2
df2 <- df2[sample(1:length(df2),length(df2))]

# combine all data
alldata <- c(df1,df2)

# order the combined data
alldata[as.vector(t(cbind(1:length(df1), gt.iv+length(df1),ad.iv+length(df1))))]

### apply the strategy to columns of dataframes called df1 & df2
# create two data frames
# alleles w/fake data
df1 <- data.frame(matrix(sample(c("0/0","1/0","0/1","1/1"),20,replace=T), nrow=2, ncol=10))
colnames(df1) <- paste("Geno", 1:ncol(df1), sep="")
# genotypes w/fake data
df2 <- data.frame(matrix(sample(1:100,20), nrow=2, ncol=20))
colnames(df2) <- paste("Geno", rep(1:(ncol(df2)/2), each=2), c(".GT",".AD"), sep="")

# add chaos to df2
df2 <- df2[,sample(1:ncol(df2),ncol(df2))]

# find the matching genotype and depth for each allele
gt.iv <- match(colnames(df1),sub(".GT","",colnames(df2)))
ad.iv <- match(colnames(df1),sub(".AD","",colnames(df2)))

# combine all data
alldata <- cbind(df1,df2)

# order the combined data
alldata <- alldata[,as.vector(t(cbind(1:ncol(df1), gt.iv+ncol(df1),ad.iv+ncol(df1))))]

Ram · Answer 2 · 2015-05-01

0

Entering edit mode

9.0 years ago

Sean Davis 26k

One-off solution:

tmpResult = cbind(df1,df2)
result = tmpResult[,order(colnames(tmpResult))]

This assumes the colnames are as given in the post.

ADD COMMENT • link updated 21 months ago by Ram 43k • written 9.0 years ago by Sean Davis 26k