Biostar Beta. Not for public use.
R programming question: genotyping data table manipulation2
2
Entering edit mode
14 months ago
MAPK ♦ 1.4k
United States

Hi Guys,

I have these two dataframes, df1 and df2. df1 with the alleles and df2 with the genoypes. There are more than 50 samples (1:50) comprised of Geno1.GT, Geno1.AD, Geno2.GT,Geno2.AD,...Geno50.GT,Geno50.AD genotypes and depth coverages interleaving one after other. How do I get the coulmns matching only Geno and sample number (i.e skipping AD or GT extentions) (eg. columns Geno1, Geno1.GT and Geno1.AD together) and get the result table. Thank you.

df1

Geno1 Geno2 Geno3
A A A
C G C
C A G

df2

Geno1.GT Geno1.AD Geno2.GT Geno2.AD Geno3.GT Geno3.AD
0/0 22,3 0/0 33,2 0/0 33,3
0/0 2,0 0/1 22,3 1/1 43,33
0/1 55,45 0/0 32,2 1/1 22,3

Result

Geno1 Geno1.GT Geno1.AD Geno2 Geno2.GT Geno2.AD Geno3 Geno3.GT Geno3.AD
A 0/0 22,3 A 0/0 33,2 A 0/0 33,3
C 0/0 2,0 G 0/1 22,3 C 1/1 43,33
C 0/1 55,45 A 0/0 32,2 G 1/1 22,3
R • 1.2k views
ADD COMMENTlink
0
Entering edit mode

Are the rows of df1 and df2 matching and in the same order?

ADD REPLYlink
0
Entering edit mode

Thank you Sean for your reply. There are more colnames in df1 Geno1:Geno100 or more. So all the the columns in df2 are present in df1, but not the other way around. DF1 is bigger than df2 in samples and the order is also different.

ADD REPLYlink
0
Entering edit mode

Sean asked about the rows. If the rows are in the same order, merging is going to be much easier.

ADD REPLYlink
0
Entering edit mode

Sorry, Yes the rows are in same order and equal in length.

ADD REPLYlink
1
Entering edit mode
15 months ago
seidel 6.8k
United States

Assuming things are in order and you don't have to do any actual pattern matching, the answer could be as simple as:

# bind them together to form one dataframe
result.df <- cbind(df1,df2)
# rearrange the columns using a pattern of numbers
result.df <- result.df[,as.vector(t(cbind(1:50,seq(51,150,2),seq(52,150,2))))]

Since there would be a regular pattern of numbers by which to arrange the columns of the two dataframes you can specify the patterns and then get them in a series of numbers to interleave the columns as you wish. (also assumes your first df is 50 columns and the second has twice as many - the code be easily modified to use the actual number).

edit: if you have to perform pattern matching because the columns of the data frames are not in any coordinated order, you can use a similar strategy, but it's a little more complicated. See below - first a toy example with vectors matching your names, then the code applied to data frames.

### Proof of principle using simple vectors
# create allele names
df1 <- paste("Geno", 1:10, sep="")

# create genotype names
df2 <- paste("Geno", rep(1:10, each=2), c(".GT",".AD"), sep="")

# find the matching genotype and depth for each allele
gt.iv <- match(df1,sub(".GT","",df2))
ad.iv <- match(df1,sub(".AD","",df2))

# randomly scramble df2
df2 <- df2[sample(1:length(df2),length(df2))]

# combine all data
alldata <- c(df1,df2)

# order the combined data
alldata[as.vector(t(cbind(1:length(df1), gt.iv+length(df1),ad.iv+length(df1))))]

### apply the strategy to columns of dataframes called df1 & df2
# create two data frames
# alleles w/fake data
df1 <- data.frame(matrix(sample(c("0/0","1/0","0/1","1/1"),20,replace=T), nrow=2, ncol=10))
colnames(df1) <- paste("Geno", 1:ncol(df1), sep="")
# genotypes w/fake data
df2 <- data.frame(matrix(sample(1:100,20), nrow=2, ncol=20))
colnames(df2) <- paste("Geno", rep(1:(ncol(df2)/2), each=2), c(".GT",".AD"), sep="")

# add chaos to df2
df2 <- df2[,sample(1:ncol(df2),ncol(df2))]

# find the matching genotype and depth for each allele
gt.iv <- match(colnames(df1),sub(".GT","",colnames(df2)))
ad.iv <- match(colnames(df1),sub(".AD","",colnames(df2)))

# combine all data
alldata <- cbind(df1,df2)

# order the combined data
alldata <- alldata[,as.vector(t(cbind(1:ncol(df1), gt.iv+ncol(df1),ad.iv+ncol(df1))))]
ADD COMMENTlink
0
Entering edit mode

Sorry, this doesn't match the column name in df1 and reorder the columns. The ordering is not correct. Thank you

ADD REPLYlink
0
Entering edit mode

I edited my answer to include pattern matching, so the column names can be in arbitrary order between the matrices.

ADD REPLYlink
0
Entering edit mode

Thank you so much Seidel, but the last line generates errors with the dataframe :

> alldata <- alldata[,as.vector(t(cbind(1:ncol(cpseq), gt.iv+ncol(cpseq),ad.iv+ncol(cpseq))))]

Error in `[.data.frame`(alldata, , as.vector(t(cbind(1:ncol(cpseq),  : 
  undefined columns selected
ADD REPLYlink
0
Entering edit mode

I'm not sure what to tell you except that you'll have to trouble-shoot and see what's going on. I edited the example to contain two fake dataframes which resemble the ones above (except my fake allele depth data - I didn't want to think that hard how to generate columns of paired, comma delimited numbers), and the code executes on copy and paste without errors. Check your intermediate steps with your actual data.

ADD REPLYlink
0
Entering edit mode
15 months ago
National Institutes of Health, Bethesda…

One-off solution:

tmpResult = cbind(df1,df2)
result = tmpResult[,order(colnames(tmpResult))]

This assumes the colnames are as given in the post.

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3