Question

Tutorial:Machine Learning For Cancer Classification - Part 3 - Predicting With A Random Forest Classifier

40

Entering edit mode

10.4 years ago

Obi Griffith 20k

This tutorial is part of a series illustrating basic concepts and techniques for machine learning in R. We will try to build a classifier of relapse in breast cancer. The analysis plan will follow the general pattern (simplified) of a recent paper.

This follows from: Machine learning for cancer classification - part 1 - preparing the data sets and Machine learning for cancer classification - part 2 - Building a Random Forest Classifier. This part covers how to use a pre-computed Random Forest classification model to predict relapse in new samples of breast cancer. We make the assumption that the data sets will be in the correct format as produced in part 1. The data sets in this tutorial are the 'test data' from the prior tutorial, retrieved from GSE2990. To avoid the hassle of Copy-Pasting every block of code, the full script can be downloaded here.

Load the necessary packages.

library(randomForest)
library(ROCR)
require(Hmisc)

Set working directory and filenames for Input/output

setwd("/Users/ogriffit/git/biostar-tutorials/MachineLearning")
RF_model_file="RF_model"
datafile="testset_gcrma.txt"
clindatafile="testset_clindetails.txt"
outfile="testset_RFoutput.txt"
case_pred_outfile="testset_CasePredictions.txt"
ROC_pdffile="testset_ROC.pdf"
vote_dist_pdffile="testset_vote_dist.pdf"

Next we will read in the data sets (expecting a tab-delimited file with header line and rownames). We also need to rearrange the clinical data so that it will be in the same order as the GCRMA expression data.

data_import=read.table(datafile, header = TRUE, na.strings = "NA", sep="\t")
clin_data_import=read.table(clindatafile, header = TRUE, na.strings = "NA", sep="\t")
clin_data_order=order(clin_data_import[,"geo_accn"])
clindata=clin_data_import[clin_data_order,]
data_order=order(colnames(data_import)[4:length(colnames(data_import))])+3 #Order data without first three columns, then add 3 to get correct index in original file
rawdata=data_import[,c(1:3,data_order)] #grab first three columns, and then remaining columns in order determined above
header=colnames(rawdata)

Extract just the expression values from the filtered data and transpose the matrix.

predictor_data=t(rawdata[,4:length(header)])
predictor_names=as.vector((rawdata[,3])) #gene symbol
colnames(predictor_data)=predictor_names

Load the RandomForests classifier from file (object rf_output which was saved previously)

load(file=RF_model_file)

Extract predictor (gene) names from RF model built in the previous tutorial and subset the test data to just these predictors. This is not strictly necessary as the randomForest predict function would automatically restrict itself to these variables. It has been added for clarity.

RF_predictor_names=rownames(rf_output$importance)
predictor_data=predictor_data[,RF_predictor_names]

Run the test data through forest!

RF_predictions_responses=predict(rf_output, predictor_data, type="response")
RF_predictions_votes=predict(rf_output, predictor_data, type="vote")

Join predictions with clinical data and exclude rows with missing clinical data needed for subsequent steps. Then, write results to file.

clindata_plusRF=cbind(clindata,RF_predictions_responses,RF_predictions_votes)
clindata_plusRF=clindata_plusRF[which(!is.na(clindata_plusRF[,"event.rfs"])),]
write.table(clindata_plusRF,file=case_pred_outfile, sep="\t", quote=FALSE, col.names=TRUE, row.names=FALSE)

As before, the following lines will give an overview of the classifier's performance. This time instead of estimated performance from out-of-bag (OOB) testing we are measuring actual performance on an independent test set.

confusion=table(clindata_plusRF[,c("event.rfs","RF_predictions_responses")])
rownames(confusion)=c("NoRelapse","Relapse")
sensitivity=(confusion[2,2]/(confusion[2,2]+confusion[2,1]))*100
specificity=(confusion[1,1]/(confusion[1,1]+confusion[1,2]))*100
overall_error=((confusion[1,2]+confusion[2,1])/sum(confusion))*100
overall_accuracy=((confusion[1,1]+confusion[2,2])/sum(confusion))*100
class1_error=confusion[1,2]/(confusion[1,1]+confusion[1,2])
class2_error=confusion[2,1]/(confusion[2,2]+confusion[2,1])

Next we will prepare each useful statistic for writing to an output file

sens_out=paste("sensitivity=",sensitivity, sep="")
spec_out=paste("specificity=",specificity, sep="")
err_out=paste("overall error rate=",overall_error,sep="")
acc_out=paste("overall accuracy=",overall_accuracy,sep="")
misclass_1=paste(confusion[1,2], colnames(confusion)[1],"misclassified as", colnames(confusion)[2], sep=" ")
misclass_2=paste(confusion[2,1], colnames(confusion)[2],"misclassified as", colnames(confusion)[1], sep=" ")
confusion_out=confusion[1:2,1:2]
confusion_out=cbind(rownames(confusion_out), confusion_out)

Create variables for the known target class and predicted class probabilities.

target=clindata_plusRF[,"event.rfs"]
target[target==1]="Relapse"
target[target==0]="NoRelapse"
relapse_scores=clindata_plusRF[,"Relapse"]

Once again, we will create an ROC curve and calculate the area under it (AUC). Recall that we use the relapse/non-relapse vote fractions as predictive variable. The ROC curve is generated by stepping through different thresholds for calling relapse vs non-relapse.

First calculate the AUC value.

pred=prediction(relapse_scores,target)
perf_AUC=performance(pred,"auc")
AUC=perf_AUC@y.values[[1]]
AUC_out=paste("AUC=",AUC,sep="")

Print the results to file along with other stats prepared above.

write("confusion table", file=outfile)
write.table(confusion_out,file=outfile, sep="\t", quote=FALSE, col.names=TRUE, row.names=FALSE, append=TRUE)
write(c(sens_out,spec_out,acc_out,err_out,misclass_1,misclass_2,AUC_out), file=outfile, append=TRUE)

Then, plot the actual ROC curve.

perf_ROC=performance(pred,"tpr","fpr")
pdf(file=ROC_pdffile)
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(AUC, digits=5, scientific=FALSE)))
dev.off()

You should now have case prediction file with which we will perform some survival analysis. See: Machine learning for cancer classification - part 4 - Plotting a Kaplan-Meier Curve for Survival Analysis.

machine-learning classification • 16k views

ADD COMMENT • link updated 11 months ago by GenoMax 141k • written 10.4 years ago by Obi Griffith 20k

0

Entering edit mode

I would be interested to look at a pythonic version of the tutorial utilizing scikits-learn or other library.

ADD REPLY • link 10.4 years ago by Pappu ★ 2.1k

0

Entering edit mode

Me too. Are you volunteering to create one?

ADD REPLY • link 10.4 years ago by Obi Griffith 20k

0

Entering edit mode

Tutorials are written by experts in the field like you. I tried machine learning a few times on datasets which were too biased or short and the features chosen did not desribe them properly. So I did not publish any paper on it.

ADD REPLY • link 10.4 years ago by Pappu ★ 2.1k

0

Entering edit mode

Dear Obi

I am using your tutorial on my test data and Part 2 worked fabulously on my training data set and I thank you wholeheartedly for that, but now I have been stuck for several days. My code gets stuck at the prediction stage where I will be generating my RF predictions. This is the code I have received on my dataset

> header=colnames(test)
> predictor_data=(test[,1:length(header)])
> predictor_names=as.vector((test[,1]))
> colnames(predictor_data)=predictor_names
> load(file=RF_model_file)
> RF_predictions_responses=predict(rf_output, predictor_data, type="response")
Error in eval(expr, envir, enclos) : object 'SubjectTimepoint' not found

I am not sure what is going on. I wrote the prediction_names and predictor_data to a csv file but I am still very much confused. I am not sure why it is not finding SubjectTimePoint, it is there in my predictor_data. I have SubjectTimePoint (factor), Relapse(factor), Window(a factor which tells what month the timepoint is in) and then several genera. I want to predict relapse or no relapse from my training data set on this new test dataset.

I tried looking up the error in StackOverflow but I am afraid I still do not understand what is going on.

I am new to R and I am just trying to learn, so please accept my apologies if this may come off awkward or off base. But if you can help me, I would so very grateful

Respectfully yours

Vijay Antharam

(mst3000jay@gmail.com)

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by mst3000jay • 0

0

Entering edit mode

Buenas noches, soy nuevo programando en R, alguien podría por favor decirme cómo solucionar este error? Good evening, I am new to programming in R, someone could tell me how to solve this error?

Good evening, I'm new programming in R GSE2034_clindata=getGSEDataTables("GSE2034")[[2]][1:286,]// run this error : Unknown IO error // and answer this I/O warning : failed to load external entity "http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?targ=self&form=xml&view=full&acc=GSE2034" Error in UseMethod("xmlRoot") : no applicable method for 'xmlRoot' applied to an object of class "NULL"

ADD REPLY • link 8.1 years ago by edwinarevaloangulo • 0

0

Entering edit mode

Similar to the previous comments, when I run the chunk;

RF_predictor_names=na.omit(rownames(rf_output$importance)) 
predictor_data=predictor_data[,RF_predictor_names]

I get the "Error in predictor_data[, RF_predictor_names] : subscript out of bounds"

and the subsequent code:

RF_predictions_responses=predict(rf_output, predictor_data, type="response")

gives

"Error in predict.randomForest(rf_output, predictor_data, type = "response") :
variables in the training data missing in newdata".

Did anybody meet and solve these problems?

ADD REPLY • link updated 11 months ago by GenoMax 141k • written 11 months ago by yedek • 0

Ram · Answer 1 · 2015-07-21

0

Entering edit mode

8.7 years ago

Gene_MMP8 ▴ 240

While trying to run the random forest classifier I am getting this error, Please help!

> rf_output=randomForest(x=predictor_data, y=target, importance = TRUE, ntree = 25001, proximity=TRUE, sampsize=sampsizes)
Error in randomForest.default(x = predictor_data, y = target, importance = TRUE,  : 
  length of response must be the same as predictors

FYI

> dim(predictor_data)
[1] 283  52

> length(target)
[1] 286

The lengths are different but I wonder why?I followed each and every steps of this tutorial.

Please help!..Its kinda urgent!

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.7 years ago by Gene_MMP8 ▴ 240

0

Entering edit mode

This is hard to guess. I just tried the commands in the first three tutorials and they worked fine. It looks like you are trying to modify the script for your own data. That is great. Are you expecting 283 cases or 286? This is likely an issue with how you have set up your data matrix and target object and is not specific to this tutorial. Carefully examine your predictor_data matrix. Does it have the expected number of rows (genes?) and columns (patients?). Ask the same question of your target vector.

ADD REPLY • link 8.7 years ago by Obi Griffith 20k

Ram · Answer 2 · 2016-03-29

0

Entering edit mode

8.1 years ago

David_emir ▴ 490

Hi,

While I Run the test data through forest

> RF_predictions_responses=predict(rf_output, predictor_data, type="response")

I am getting the following error:

> "Variables in the training data missing in New data"

I did check, it looks fine to me, Kindly Help. I am using the same data set which is used in this tutorials.

Thanks a lot
Dave.

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.1 years ago by David_emir ▴ 490

0

Entering edit mode

Hi Dave, check the col names. If the col names in the original and the new data do not agree this error is thrown.

try ,

 identical(names(original_data),names(new_data))

if this returns "FALSE " the col names do not match. Make sure the col names in original data and the new data are same. Try and let me know.

ADD REPLY • link 8.0 years ago by kavya.krishnamurthy284 ▴ 10

0

Entering edit mode

Hi, I had the same problem as you. I used processAffyTestData.R script from the first part and it seems that the problem was this part of the script

gcrma=gcrma[1:12030,]

when I replaced it with

gcrma=gcrma[which(!grepl("AFFX", rownames(gcrma))),]

the problem disappeared

ADD REPLY • link 8.0 years ago by minio.cz ▴ 10

0

Entering edit mode

This happens because over time the custom CDF files change so that hardcoded number to eliminate AFFX (control) probes becomes incorrect. It is then possible that some of these get included in the model. We added the alternative way to remove AFFX probes in some places but missed others and this likely created the disconnect where an AFFX probe got included in the trained model but then wasn't available in the test data. I've tried to make sure this is consistent everywhere now.

ADD REPLY • link 7.0 years ago by Obi Griffith 20k

0

Entering edit mode

Hi David,

I get the same error. I think, the number of predictor variables used to build the model equal to 1246 (after filtering). But, the number of predictor variables in the test data (ie predictor_data) equal to 12030. those numbers are different. I suggest to perform the filtering in such a way you keep only the predictor variables of the model in predictor_data.

ADD REPLY • link 7.0 years ago by debitboro ▴ 260

0

Entering edit mode

Hi, I've solved the problem changing the statement of Random Forest (in Part.2) in this way :.

rf_output=randomForest(x=predictor_data, y=target, importance = TRUE, ntree = 10001, proximity=TRUE, sampsize=sampsizes, na.action = na.omit)

After this little adjustment, the following instruction Works without errors :

RF_predictions_responses=predict(rf_output, predictor_data, type="response")   # Works

Furthermore some rows before, another little change was necessary for me, to fix an error of "out of bound" in the subsetting of predictor_data :

RF_predictor_names=na.omit(rownames(rf_output$importance)) #--> the NA values need to be avoided
predictor_data=predictor_data[,RF_predictor_names] # Now the SUBSETTING Works

And finally I had to modify these part of the code (because clindata had 189 rows and RF_predictions_responses and RF_predictions_votes had 286 rows) :

clindata_plusRF=cbind(clindata, RF_predictions_responses=RF_predictions_responses[1:nrow(clindata)],RF_predictions_votes=RF_predictions_votes[1:nrow(clindata),])
clindata_plusRF=clindata_plusRF[which(!is.na(clindata_plusRF[,"event.rfs"])),]#NO MODIFICATIONS, these is ok
colnames(clindata_plusRF)[15]="NoRelapse" #  Added to set the column name
colnames(clindata_plusRF)[16]="Relapse" # Added to set the column name

Now the code works in my computer.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 5.7 years ago by m.colonna1 • 0

score 0 · Answer 3 · 2017-04-15

0

Entering edit mode

7.0 years ago

debitboro ▴ 260

Hi guys,

Is this script written by experts ??? I don't understand: You have constructed your model based on the filtered predictor variables of length equal to 1260, after that you want to test it on new cases described by 12030 predictor variables !!!

ADD COMMENT • link 7.0 years ago by debitboro ▴ 260

1

Entering edit mode

@debitboro - welcome to Biostars. The predict function will only use predictor variables that exist in the (previously trained) model that is provided. Running most recently, I get 1247 predictors after filtering the training data down from 12264 total predictors. The RF model is trained on, and only knows about those 1247 predictors. The test dataset also starts with 12264 total predictors (Note: these numbers can change depending on the version of customCDF you use, and those CDFs change as gene annotations evolve). When we supply the pre-trained RF model and the new test data to the predict function, it only considers the 1247 predictors in the RF model. If one of the necessary predictors was missing in the test data then you would get an error. But it doesn't care about extra predictors, they are simply ignored. in this case, since we created the train and test datasets using the exact same method we can expect all the necessary predictors will be there. It is not necessary but we could make this explicit by adding something like:

RF_predictor_names=rownames(rf_output$importance)
predictor_data=predictor_data[,RF_predictor_names]

This would reduce the predictor_data variable to just those genes in the RF model. But as explained above, it is not required since this happens automatically.

As an aside, it is poor etiquette to call into question the expertise of forum contributors. Questions and potential problems are of course welcome. But, the content provided here is not guaranteed to meet your expectations and the people who contribute it do so on a volunteer basis to help others, despite having many other obligations on their time and expertise.

ADD REPLY • link 7.0 years ago by Obi Griffith 20k