This package provides simple interface for cross validation and hyper-parameter tuning for statistical/machine learning models. Currently, I’m designing specification of APIs and object therefore they can be changed in future version. If you find bugs or if you have any requests, please let me know by GitHub or email: paste following code into R to get my address: rawToChar(as.raw(c(109, 111, 103, 64, 102, 102, 112, 114, 105, 46, 97, 102, 102, 114, 99, 46, 103, 111, 46, 106, 112))).
To install the package, copy & paste following script into R console. Currently, it may produce many warnings related with compilation of documents, but it does not affect any functionality of the package.
install.packages(
c("R6", "model.adapter", "cv.models"), type = "source",
repos = c(
"http://florivory.net/R/repos", options()$repos
)
)
cv.models
function evaluate predictive ability of a model. Most basic usage of the cv.models
function is putting call of statistical/machine learning model function directory into cv.models
as follow.
library(cv.models)
data(iris)
cv <- cv.models(glm(Petal.Length ~ ., data = iris))
print(cv)
## Result of cross validation
## Cross validation metrics:
## mse rmse r.squared spearman kendall q.squared sd.mse
## 1 0.07160118 0.2637418 0.9773273 0.9338794 0.8322722 0.9740298 0.0257532
## sd.rmse sd.r.squared sd.spearman sd.kendall sd.q.squared
## 1 0.04762617 0.008654745 0.05044482 0.07480678 0.01150527
In the call of statistical/machine learning models, please use data
argument for specifying data like glm(Petal.Length ~ Petal.Width, data = iris)
. Currently, model call with $
operator such as glm(iris$Petal.Length ~ iris$Petal.Width)
can not work. If someone needs to use latter case, please contact me.
Users can use cv.models
by passing call
object for statistical/machine learning models.
# Using substitute() function
call.glm <- substitute(glm(Petal.Length ~ ., data = iris))
cv <- cv.models(call.glm)
print(cv)
## Result of cross validation
## Cross validation metrics:
## mse rmse r.squared spearman kendall q.squared sd.mse
## 1 0.0720893 0.2642072 0.9786271 0.95254 0.855111 0.9758841 0.02458692
## sd.rmse sd.r.squared sd.spearman sd.kendall sd.q.squared
## 1 0.05037491 0.006491544 0.02646257 0.06361177 0.007255751
# Using call() function
call.glm <- call("glm", Petal.Length ~ ., data = iris)
cv <- cv.models(call.glm)
print(cv)
## Result of cross validation
## Cross validation metrics:
## mse rmse r.squared spearman kendall q.squared sd.mse
## 1 0.07225901 0.2609248 0.9764348 0.9429454 0.8348614 0.9741313 0.03779596
## sd.rmse sd.r.squared sd.spearman sd.kendall sd.q.squared
## 1 0.06812785 0.01431566 0.01652288 0.04044752 0.01500444
Objects of some statistical/machine learning models can be used for cross-validation.
Only model objects keeping original call
such as glm
or gbm
can be used for cv.models
.
model <- glm(Petal.Length ~ ., data = iris)
cv <- cv.models(model)
print(cv)
## Result of cross validation
## Cross validation metrics:
## mse rmse r.squared spearman kendall q.squared sd.mse
## 1 0.07144265 0.2634059 0.9786586 0.9402786 0.8399003 0.9752645 0.02485682
## sd.rmse sd.r.squared sd.spearman sd.kendall sd.q.squared
## 1 0.04784199 0.008179856 0.02563387 0.04283829 0.01095734
cv.models
automatically calculates appropriate metrics of model evaluation by detecting type of the model (i.e., regression or classification models). Currently cv.models
calculates following metrics.
Name of metrics | Column | Definition/Explanation |
---|---|---|
Mean squared error (MSE) | “mse” | \(MSE = mean((prediction - response) ^ 2)\) |
Root mean squared error (RMSE) | “rmse” | \(RMSE = sqrt(mean((prediction - response) ^ 2))\) |
R squared | “r.squared” | \(R ^ 2 = cor(prediction, response) ^ 2\) This is calculated only for referencing purpose. I don’t recommend this metrics because adjustment of intercept and slope is done during calculation. For example, idel models produce prediction on a line y = x when plotting actual and predicted values. However, when plotted points of actual and predicted values are completely on a line of y = -2x + 1 (in other words, completely wrong prediction model), value of R squared become 1. |
Spearman’s ρ | “spearman” | \(\rho = cor(prediction, response, method = "spearman")\) |
Kendall’s τ | “kendall” | \(\tau = cor(prediction, response, method = "kendall")\) |
Q squared | “q.squared” | \(Q ^ 2 = 1 - \sum((prediction - response) ^ 2) / \sum((response - mean(response)) ^ 2)\), See Consonni et al. (2009). Currently, mean values of each fold were used for mean of data (\(mean(response)\)) by default. On the other hand, mean value of all data is used when aggregate.method = "join" (see below) is specified. I will consider to find better default for improving consistency. |
Name of metrics | Column | Definition/Explanation |
---|---|---|
Optimal threshold | “threshold” | Optimal threshold separating presence/absence. By default, determined by Youden’s J. |
Specificity | “specificity” | See Wikipedia |
Sensitivity | “sensitivity” | |
Accuracy | “accuracy” | |
True negative count (TN) | “tn” | |
True positive count (TP) | “tp” | |
False negative count (FN) | “fn” | |
False positive count (FP) | “fp” | |
Negative predictive value (NPV) | “npv” | |
Positive predictive value (PPV) | “ppv” | |
Diagnostic likelihood ratio (DRL)+ | “dlr.positive” | \(DRL.positive = Sensitivity / (1 - Specificity)\), See Lopez-Raton et al. (2014) |
Diagnostic likelihood ratio (DRL)- | “dlr.negative” | \(DRL.negative = (1 - Sensitivity) / Specificity\), See Lopez-Raton et al. (2014) |
Matthews correlation coefficient (MCC) | “mcc” | |
Informedness | “informedness” | \(Informedness = Sensitivity + Specificity - 1\) |
Markedness | “markedness” | \(Markedness = PPV + NPV - 1\) |
Log-likelihood | “loglik” | \(Log_likelihood = \sum(log(p * y + (1 - p) * (1 - y)))\), See Lawson et al. (2014) |
Likelihood based R squared | “rsq.loglik” | See Lawson et al. (2014) |
Cohen’s Kappa | “kappa” | See Cohen (1960) |
For some models like gbm
requires additional argument(s) for predict
method which can affect predictive ability. To use such models with cv.models
, specify the argument(s) as additional argument(s) of cv.models
.
library(gbm)
cv <- cv.models(
gbm(Petal.Length ~ ., data = iris, distribution = "gaussian", n.cores = 1),
n.trees = 50
)
print(cv)
## Result of cross validation
## Cross validation metrics:
## mse rmse r.squared spearman kendall q.squared sd.mse
## 1 0.1097684 0.3217449 0.966524 0.910217 0.8059406 0.9595215 0.05740915
## sd.rmse sd.r.squared sd.spearman sd.kendall sd.q.squared
## 1 0.08332443 0.02538766 0.102032 0.1277251 0.02270992
By default, cv.models
determines positive class of the response variable by following rules.
1
as the positive class.TRUE
as the positive class.levels(response)[1]
) as the positive class.unique(response)[1]
) as the positive class.When using factor or character as the response variable of the model, you can use positive.class
argument to control which class is used for positive class. Classes which are not treated as positive class are treated as negative class. In the following example, cv.models
treat versicolor in the iris
data as the positive class, and setosa and virginica are treated as the negative class.
cv <- cv.models(
gbm(Species ~ ., data = iris, n.trees = 50, distribution = "multinomial"),
n.cores = 1, n.trees = 50, positive.class = "versicolor"
)
print(cv)
## Result of cross validation
## Cross validation metrics:
## method threshold sensitivity specificity ppv npv dlr.positive
## 1 Youden 0.6129907 1 0.9725758 0.9383333 1 Inf
## dlr.negative fp fn auc tp tn accuracy informedness markedness
## 1 0 0.3 0 0.9901717 5 9.7 0.98 0.9725758 0.9383333
## mcc loglik rsq.loglik kappa sd.threshold sd.sensitivity
## 1 0.9547984 -2.467852 0.6411764 0.9514063 0.4612871 0
## sd.specificity sd.ppv sd.npv sd.dlr.positive sd.dlr.negative
## 1 0.0443321 0.1012423 0 NaN 0
## sd.fp sd.fn sd.auc sd.tp sd.tn sd.accuracy sd.informedness
## 1 0.4830459 0 0.01836965 1.563472 1.418136 0.03220306 0.0443321
## sd.markedness sd.mcc sd.loglik sd.rsq.loglik sd.kappa
## 1 0.1012423 0.07331211 2.515344 0.2936008 0.07889443
By default, cv.models
calculates performance metrics for each fold and then averages them. But this method can’t work for Leave-One-Out cross validation (LOOCV) or data with class imbalance in the response variable. I’m not sure what is the best method handling this situation, but currently by specifying aggregate.method = "join"
, cv.models
calculate metrics after joining results from all folds.
cv <- cv.models(
gbm(Petal.Length ~ ., data = iris, distribution = "gaussian", n.cores = 1),
aggregate.method = "join", n.trees = 50
)
print(cv)
## Result of cross validation
## Cross validation metrics:
## mse rmse r.squared spearman kendall q.squared sd.mse
## 1 0.1113121 0.3336346 0.9647023 0.9418911 0.796786 0.9640407 NA
## sd.rmse sd.r.squared sd.spearman sd.kendall sd.q.squared
## 1 NA NA NA NA NA
In this situation, SD of each metrics become NA.
When class imbalance exists in the response variable, random dividing of dataset into folds can create fold(s) without observation with specific class and can disrupt cross validation. Class stratification (a method to divide dataset into folds with keeping frequency of classes) can help this situation.
To use class stratification, specify stratify = TRUE
in cv.models
.
library(randomForest)
# Create dataset with small frequency of setosa.
test.data <- subset(iris, Species != "setosa")
test.data <- rbind(test.data, head(subset(iris, Species == "setosa"), 10))
# Cross validation.
cv <- cv.models(
randomForest(Species ~ ., data = iris, n.cores = 1),
aggregate.method = "join", n.trees = 50, stratify = TRUE
)
print(cv)
## Result of cross validation
## Cross validation metrics:
## method threshold sensitivity specificity ppv npv dlr.positive
## 1 Youden 0.94 1 1 1 1 Inf
## dlr.negative fp fn auc tp tn accuracy informedness markedness mcc
## 1 0 0 0 1 50 100 1 1 1 1
## loglik rsq.loglik kappa sd.threshold sd.sensitivity sd.specificity
## 1 -0.2362728 0.9871859 1 NA NA NA
## sd.ppv sd.npv sd.dlr.positive sd.dlr.negative sd.fp sd.fn sd.auc sd.tp
## 1 NA NA NA NA NA NA NA NA
## sd.tn sd.accuracy sd.informedness sd.markedness sd.mcc sd.loglik
## 1 NA NA NA NA NA NA
## sd.rsq.loglik sd.kappa
## 1 NA NA
Currently, class stratification is implemented only for classification models.
Previous version of cv.models
used Youden’s J with coords
function in pROC
package to find optimal threshold dividing positive and negative classes. After ver. 0.1.0, cv.models
uses optimal.cutpoints
function in OptimalCutpoints
package and this allows us to use other metrics for determining optimal threshold. By default, cv.models
still use Youden’s J for determination of the threshold, but users can specify a list
having options passed to optimal.cutpoints
for the cutpoint.options
argument to control method used for threshold determination. In the following example, sensitivity, instead of Youden’s J, is maximized by specifying list(methods = "MaxSe")
for cutpoint.options
.
cv <- cv.models(
gbm(Species ~ ., data = iris, n.trees = 50, distribution = "multinomial"),
n.cores = 1, n.trees = 50, positive.class = "setosa",
cutpoint.options = list(methods = "MaxSe")
)
print(cv)
## Result of cross validation
## Cross validation metrics:
## method threshold sensitivity specificity ppv npv dlr.positive
## 1 MaxSe 0.9958904 1 1 1 1 Inf
## dlr.negative fp fn auc tp tn accuracy informedness markedness mcc
## 1 0 0 0 1 5 10 1 1 1 1
## loglik rsq.loglik kappa sd.threshold sd.sensitivity sd.specificity
## 1 -0.02085396 0.8966806 1 0.001572169 0 0
## sd.ppv sd.npv sd.dlr.positive sd.dlr.negative sd.fp sd.fn sd.auc
## 1 0 0 NaN 0 0 0 0
## sd.tp sd.tn sd.accuracy sd.informedness sd.markedness sd.mcc
## 1 1.763834 1.763834 0 0 0 0
## sd.loglik sd.rsq.loglik sd.kappa
## 1 0.006152772 0.01801658 0
optimal.cutpoints
can determine multiple thresholds simultaneously. For example, by specifying list(methods = c("MaxSe", "MaxSp"))
for cutpoint.options
, cv.models
can calculate two sets of metrics with the thresholds determined by sensitivity and specificity, yet they bring same results… :).
cv <- cv.models(
gbm(Species ~ ., data = iris, n.trees = 50, distribution = "multinomial"),
n.cores = 1, n.trees = 50, positive.class = "setosa",
cutpoint.options = list(methods = c("MaxSe", "MaxSp"))
)
print(cv)
## Result of cross validation
## Cross validation metrics:
## method threshold sensitivity specificity ppv npv dlr.positive
## 1 MaxSe 0.9959352 1 1 1 1 Inf
## 2 MaxSp 0.9959352 1 1 1 1 Inf
## dlr.negative fp fn auc tp tn accuracy informedness markedness mcc
## 1 0 0 0 1 5 10 1 1 1 1
## 2 0 0 0 1 5 10 1 1 1 1
## loglik rsq.loglik kappa sd.threshold sd.sensitivity sd.specificity
## 1 -0.02285701 0.8887343 1 0.001461932 0 0
## 2 -0.02285701 0.8887343 1 0.001461932 0 0
## sd.ppv sd.npv sd.dlr.positive sd.dlr.negative sd.fp sd.fn sd.auc
## 1 0 0 NaN 0 0 0 0
## 2 0 0 NaN 0 0 0 0
## sd.tp sd.tn sd.accuracy sd.informedness sd.markedness sd.mcc
## 1 2.788867 2.788867 0 0 0 0
## 2 2.788867 2.788867 0 0 0 0
## sd.loglik sd.rsq.loglik sd.kappa
## 1 0.008369546 0.02159467 0
## 2 0.008369546 0.02159467 0
Users can specify methods
, op.prev
and control
of optimal.cutpoints
in the list used for cutpoint.options
and other options are ignored. For the details of optimal.cutpoints
see the manual and reference (Lopez-Raton et al. 2014) of it.
To change number of folds, change folds
argument.
cv <- cv.models(
gbm(Petal.Length ~ ., data = iris, distribution = "gaussian", n.cores = 1),
n.trees = 50, folds = 5
)
print(cv)
## Result of cross validation
## Cross validation metrics:
## mse rmse r.squared spearman kendall q.squared sd.mse
## 1 0.1051745 0.3216322 0.9663009 0.9387993 0.8253669 0.9642212 0.0293904
## sd.rmse sd.r.squared sd.spearman sd.kendall sd.q.squared
## 1 0.04646493 0.01193624 0.03115216 0.044694 0.01079232
Because cross validation and some modeling methods use stochastic processes, cv.models
returns slightly different results every time. To fix the result, specify some number for seed
argument. When seed
is set, cv.models
returns same result independent of parallel computation.
# Results are different each time.
cv <- cv.models(
gbm(Petal.Length ~ ., data = iris, distribution = "gaussian", n.cores = 1),
n.trees = 50
)
print(cv)
## Result of cross validation
## Cross validation metrics:
## mse rmse r.squared spearman kendall q.squared sd.mse
## 1 0.09922069 0.3105432 0.9691495 0.9298997 0.8242038 0.9664689 0.03640097
## sd.rmse sd.r.squared sd.spearman sd.kendall sd.q.squared
## 1 0.05561364 0.01098273 0.04932323 0.08184355 0.01029104
cv <- cv.models(
gbm(Petal.Length ~ ., data = iris, distribution = "gaussian", n.cores = 1),
n.trees = 50
)
print(cv)
## Result of cross validation
## Cross validation metrics:
## mse rmse r.squared spearman kendall q.squared sd.mse
## 1 0.1082467 0.3173534 0.9674564 0.932942 0.8336295 0.9630571 0.0624551
## sd.rmse sd.r.squared sd.spearman sd.kendall sd.q.squared
## 1 0.09149091 0.01911328 0.03345093 0.06072541 0.02018797
# If seed is set, same result is returned every time.
cv <- cv.models(
gbm(Petal.Length ~ ., data = iris, distribution = "gaussian", n.cores = 1),
n.trees = 50, seed = 12345
)
print(cv)
## Result of cross validation
## Cross validation metrics:
## mse rmse r.squared spearman kendall q.squared sd.mse
## 1 0.1059846 0.3208402 0.9668684 0.9323122 0.8262999 0.961941 0.04231908
## sd.rmse sd.r.squared sd.spearman sd.kendall sd.q.squared
## 1 0.05817747 0.01486364 0.04670806 0.06900898 0.01723832
cv <- cv.models(
gbm(Petal.Length ~ ., data = iris, distribution = "gaussian", n.cores = 1),
n.trees = 50, seed = 12345
)
print(cv)
## Result of cross validation
## Cross validation metrics:
## mse rmse r.squared spearman kendall q.squared sd.mse
## 1 0.1059846 0.3208402 0.9668684 0.9323122 0.8262999 0.961941 0.04231908
## sd.rmse sd.r.squared sd.spearman sd.kendall sd.q.squared
## 1 0.05817747 0.01486364 0.04670806 0.06900898 0.01723832
By default, cv.models
randomly separates data into several folds and runs cross validation. However, when each observation are not independent each other (e.g., when dataset having some internal structure such as spatial autocorrelation), cross validation with random separation would bring optimistic estimates of model performance measures (Roberts et al. 2017). In this situation, blocked cross validation would bring better results (Roberts et al. 2017) and groupings of datasets produced by other packages such as blockCV
package (Valavi et al. 2018) can be used for cross validation. To use user defined grouping rather than default random separation of dataset, specify a vector representing grouping of observations for group
argument of cv.models
. For the group
argument, a vector of character
, factor
, integer
and logical
having same length as the number of observation can be used.
Following example shows testing predictive ability of the model for different species using the iris
data.
cv <- cv.models(
randomForest(Petal.Length ~ ., data = iris, n.cores = 1, n.trees = 50),
n.cores = 1, group = iris$Species
)
print(cv)
## Result of cross validation
## Cross validation metrics:
## mse rmse r.squared spearman kendall q.squared sd.mse
## 1 4.19397 1.749769 0.1980847 0.3585859 0.2723849 -106.9464 4.610268
## sd.rmse sd.r.squared sd.spearman sd.kendall sd.q.squared
## 1 1.303234 0.3233384 0.3140936 0.2351445 177.1127
By default, cv.models
uses all logical/physical cores. To control number of cores (i.e., processes) used for cross validation, specify a number for n.cores
argument.
# Cross validation with 2 cores.
cv <- cv.models(
gbm(Petal.Length ~ ., data = iris, distribution = "gaussian", n.cores = 1),
n.trees = 50, n.cores = 2
)
print(cv)
## Result of cross validation
## Cross validation metrics:
## mse rmse r.squared spearman kendall q.squared sd.mse
## 1 0.1089966 0.323703 0.9687782 0.9363097 0.8391552 0.962467 0.04566111
## sd.rmse sd.r.squared sd.spearman sd.kendall sd.q.squared
## 1 0.06841881 0.009601633 0.04613624 0.08131339 0.01444266
cv.models
can conduct grid search to find which combination of hyper-parameters can bring model(s) with best performance using cross validation. To conduct grid search for hyper-parameters, supply a named list having vector of candidate parameters for grid
argument. In the following example, cv.models
calculates performance measures with cross validation for all combination of hyper-parameters (in this example, 1
and 5
for interaction.depth
and 1
and 10
for n.minobsinnode
) of gbm
# See the effect of hyper parameter on predictive ability of the model.
cv <- cv.models(
gbm(Petal.Length ~ ., data = iris, distribution = "gaussian", n.cores = 1),
n.trees = 50,
grid = list(interaction.depth = c(1, 5), n.minobsinnode = c(1, 10))
)
print(cv)
## Result of cross validation
## Cross validation metrics:
## interaction.depth n.minobsinnode mse rmse r.squared
## 1 1 1 0.07970788 0.2778357 0.9735727
## 2 5 1 0.07156345 0.2621945 0.9763217
## 3 1 10 0.10164681 0.3121165 0.9688425
## 4 5 10 0.09726478 0.3067847 0.9671420
## spearman kendall q.squared sd.mse sd.rmse sd.r.squared
## 1 0.9378012 0.8526449 0.9704077 0.03137318 0.05286475 0.01697545
## 2 0.9359555 0.8339149 0.9725090 0.03094536 0.05595131 0.01293670
## 3 0.9264031 0.8262655 0.9636375 0.04557239 0.06855731 0.01499536
## 4 0.9033653 0.7841828 0.9643673 0.03723326 0.05914151 0.01497294
## sd.spearman sd.kendall sd.q.squared
## 1 0.03324120 0.05584728 0.01625107
## 2 0.03288226 0.05709310 0.01505707
## 3 0.07819431 0.09631725 0.01763200
## 4 0.08094862 0.09215561 0.01479924
For models having extra argument(s) for predict
method affecting predictive ability (e.g., n.tree
argument of gbm
), users can specify a list of candidate parameter(s) for grid.predict
to conduct grid search for such parameter(s).
# See the effect of hyper parameter on predictive ability of the model.
cv <- cv.models(
gbm(Petal.Length ~ ., data = iris, distribution = "gaussian", n.cores = 1),
grid = list(interaction.depth = c(1, 5), n.minobsinnode = c(1, 10)),
grid.predict = list(n.trees = c(10, 50, 80))
)
print(cv)
## Result of cross validation
## Cross validation metrics:
## interaction.depth n.minobsinnode n.trees mse rmse r.squared
## 1 1 1 10 0.70940672 0.8347681 0.9276824
## 2 1 1 50 0.07255751 0.2605845 0.9774829
## 3 1 1 80 0.06828483 0.2533573 0.9779486
## 4 5 1 10 0.45822331 0.6706116 0.9780200
## 5 5 1 50 0.07100390 0.2640894 0.9777949
## 6 5 1 80 0.07503506 0.2720042 0.9763532
## 7 1 10 10 0.72188918 0.8345397 0.9284288
## 8 1 10 50 0.10160226 0.3095017 0.9678100
## 9 1 10 80 0.09462196 0.2995565 0.9688744
## 10 5 10 10 0.50339378 0.7040647 0.9672351
## 11 5 10 50 0.09227950 0.2948742 0.9733669
## 12 5 10 80 0.09235612 0.2952405 0.9735409
## spearman kendall q.squared sd.mse sd.rmse sd.r.squared
## 1 0.9290615 0.8471718 0.7569029 0.20330561 0.11817559 0.030193683
## 2 0.9408784 0.8488161 0.9744224 0.03665937 0.07190453 0.013515508
## 3 0.9365843 0.8411318 0.9757082 0.03562060 0.06745296 0.013617705
## 4 0.9450372 0.8651972 0.8404915 0.12484258 0.09720186 0.012273615
## 5 0.9340507 0.8470638 0.9737058 0.01995886 0.03742636 0.009553237
## 6 0.9328022 0.8374948 0.9724562 0.01850928 0.03413663 0.008959014
## 7 0.9131920 0.8177403 0.7377688 0.30248827 0.16810262 0.037922414
## 8 0.9037919 0.8014867 0.9605600 0.05114317 0.08035315 0.021650395
## 9 0.8969894 0.7897869 0.9625695 0.04388729 0.07369497 0.021304530
## 10 0.9348503 0.8223210 0.8321282 0.13776870 0.09241579 0.018733256
## 11 0.9321754 0.8118652 0.9680747 0.04770093 0.07694683 0.016718066
## 12 0.9298682 0.8072078 0.9679590 0.04748347 0.07593245 0.016728812
## sd.spearman sd.kendall sd.q.squared
## 1 0.02437750 0.03299084 0.05754771
## 2 0.02094661 0.04276941 0.01433604
## 3 0.03540891 0.05493194 0.01441225
## 4 0.03521232 0.06358434 0.02411691
## 5 0.04367401 0.06367079 0.01125424
## 6 0.04360961 0.06678759 0.01009343
## 7 0.05123220 0.07460691 0.07289115
## 8 0.08817375 0.12461058 0.02460951
## 9 0.10013855 0.14334661 0.02362557
## 10 0.02155212 0.04877876 0.03500259
## 11 0.03233777 0.04936805 0.01860968
## 12 0.03315534 0.05306460 0.01898663
After grid search, you can extract model(s) with best predictive ability using find.best.models
function for the result of cv.models
. In the following example, find.best.models
function extracts model(s) with the highest value of Q^2.
# See the effect of hyper parameter on predictive ability of the model.
cv <- cv.models(
gbm(Petal.Length ~ ., data = iris, distribution = "gaussian", n.cores = 1),
grid = list(interaction.depth = c(1, 5), n.minobsinnode = c(1, 10)),
grid.predict = list(n.trees = c(10, 50, 80))
)
print(cv)
## Result of cross validation
## Cross validation metrics:
## interaction.depth n.minobsinnode n.trees mse rmse r.squared
## 1 1 1 10 0.71846209 0.8313373 0.9375318
## 2 1 1 50 0.07800233 0.2761946 0.9770139
## 3 1 1 80 0.07182894 0.2663227 0.9770512
## 4 5 1 10 0.45879621 0.6741641 0.9773939
## 5 5 1 50 0.06968244 0.2581932 0.9762469
## 6 5 1 80 0.07297073 0.2639482 0.9741611
## 7 1 10 10 0.72679446 0.8386957 0.9223218
## 8 1 10 50 0.11476425 0.3300293 0.9652568
## 9 1 10 80 0.10587586 0.3181813 0.9666815
## 10 5 10 10 0.49000899 0.6929212 0.9693967
## 11 5 10 50 0.09287782 0.3008929 0.9733423
## 12 5 10 80 0.08891006 0.2940389 0.9748969
## spearman kendall q.squared sd.mse sd.rmse sd.r.squared
## 1 0.9211462 0.8297653 0.7621652 0.31841651 0.17429336 0.024474695
## 2 0.9476352 0.8598201 0.9732361 0.02629821 0.04370169 0.007656814
## 3 0.9380785 0.8433653 0.9745011 0.01759604 0.03164326 0.006777259
## 4 0.9497234 0.8622065 0.8289065 0.09806133 0.06911280 0.011764558
## 5 0.9555615 0.8708582 0.9739878 0.03161776 0.05791498 0.012940980
## 6 0.9463290 0.8555012 0.9720997 0.03264445 0.06057217 0.015555032
## 7 0.8988592 0.8029528 0.7545430 0.28155642 0.16118993 0.044251930
## 8 0.9213254 0.8042133 0.9595982 0.04945501 0.08058753 0.016519557
## 9 0.9171429 0.7930354 0.9620922 0.04117472 0.07177516 0.016537674
## 10 0.9266675 0.8139262 0.8382962 0.14209819 0.10471793 0.011958271
## 11 0.9387832 0.8290341 0.9681454 0.03056807 0.05100402 0.009005588
## 12 0.9310652 0.8133410 0.9694114 0.03073416 0.05218757 0.009269696
## sd.spearman sd.kendall sd.q.squared
## 1 0.03011275 0.03784255 0.046181444
## 2 0.02484243 0.04270127 0.005470736
## 3 0.02855585 0.05496531 0.006521107
## 4 0.01579183 0.03141194 0.073645898
## 5 0.02780311 0.04833118 0.014315469
## 6 0.03109811 0.05857567 0.016755934
## 7 0.05972741 0.06872107 0.048706163
## 8 0.05259488 0.06576310 0.018044766
## 9 0.05243438 0.06157570 0.017725410
## 10 0.03192243 0.05546427 0.025920873
## 11 0.03089184 0.05802708 0.010064317
## 12 0.03197896 0.06039529 0.010330123
# Get best model.
best <- find.best.models(cv, "q.squared")
print(best)
## ------------------------------------------------
## Result of model selection by cross validation
## -> 1 best models was found:
## ------------------------------------------------
## Model index 1
## ------------------------------------------------
## Result of cross validation
## Cross validation metrics:
## interaction.depth n.minobsinnode n.trees mse rmse r.squared
## 1 1 1 80 0.07182894 0.2663227 0.9770512
## spearman kendall q.squared sd.mse sd.rmse sd.r.squared
## 1 0.9380785 0.8433653 0.9745011 0.01759604 0.03164326 0.006777259
## sd.spearman sd.kendall sd.q.squared
## 1 0.02855585 0.05496531 0.006521107
The result of find.best.models
function is object of cv.best.models
, which is a list of cv.result
object.
# See the class of the result of find.best.models().
class(best)
## [1] "cv.best.models"
# The results of find.best.models() is a list of cv.result object.
class(best[[1]])
## [1] "cv.result"
print(best[[1]])
## Result of cross validation
## Cross validation metrics:
## interaction.depth n.minobsinnode n.trees mse rmse r.squared
## 1 1 1 80 0.07182894 0.2663227 0.9770512
## spearman kendall q.squared sd.mse sd.rmse sd.r.squared
## 1 0.9380785 0.8433653 0.9745011 0.01759604 0.03164326 0.006777259
## sd.spearman sd.kendall sd.q.squared
## 1 0.02855585 0.05496531 0.006521107
Using following functions, you can extract information from the result of cv.models
.
cv <- cv.models(
gbm(Petal.Length ~ ., data = iris, distribution = "gaussian", n.cores = 1),
grid = list(interaction.depth = c(1, 5), n.minobsinnode = c(1, 10)),
grid.predict = list(n.trees = c(10, 50, 80))
)
# Extract prediction of the 10th model.
# The result is a data.frame having three columns:
# "response" having original values of response variable,
# "prediction" having prediction of the model,
# "index" having index of response variable in the original dataset.
fit <- extract.fit(cv, 10)
head(fit)
## response prediction index
## 1 1.4 2.275855 1
## 2 1.4 2.238841 2
## 3 1.6 2.238841 30
## 4 1.4 2.275855 34
## 5 1.3 2.238841 43
## 6 4.9 4.379440 53
# Extract the detailed information of 10th model.
# The result is an object of cv.result.
extract.result(cv, 10)
## Result of cross validation
## Cross validation metrics:
## interaction.depth n.minobsinnode n.trees mse rmse r.squared
## 1 5 10 10 0.49064 0.6961741 0.9674917
## spearman kendall q.squared sd.mse sd.rmse sd.r.squared
## 1 0.9302293 0.812527 0.8384762 0.1151032 0.08152483 0.02182448
## sd.spearman sd.kendall sd.q.squared
## 1 0.03004164 0.05664544 0.0198745
# Extract table of performance metrics.
extract.metrics(cv)
## interaction.depth n.minobsinnode n.trees mse rmse r.squared
## 1 1 1 10 0.71175871 0.8255634 0.9388859
## 2 1 1 50 0.08085989 0.2763715 0.9781593
## 3 1 1 80 0.07465255 0.2672395 0.9791906
## 4 5 1 10 0.45859380 0.6704822 0.9782709
## 5 5 1 50 0.06940618 0.2616235 0.9799677
## 6 5 1 80 0.07398858 0.2704843 0.9780468
## 7 1 10 10 0.72252293 0.8363674 0.9120885
## 8 1 10 50 0.10507791 0.3149128 0.9560609
## 9 1 10 80 0.09986131 0.3077974 0.9580154
## 10 5 10 10 0.49064000 0.6961741 0.9674917
## 11 5 10 50 0.09147345 0.2890595 0.9717959
## 12 5 10 80 0.08797113 0.2826143 0.9728868
## spearman kendall q.squared sd.mse sd.rmse sd.r.squared
## 1 0.9233408 0.8352378 0.7674455 0.29728458 0.18319308 0.026928756
## 2 0.9354004 0.8446767 0.9721795 0.03830157 0.07054320 0.011071687
## 3 0.9304714 0.8324126 0.9733951 0.03074939 0.05995924 0.009795229
## 4 0.9519100 0.8678236 0.8482883 0.14205656 0.10026301 0.007476456
## 5 0.9460936 0.8450875 0.9765642 0.01653659 0.03264850 0.006240878
## 6 0.9375281 0.8294852 0.9748364 0.01547064 0.03030980 0.007332415
## 7 0.8913047 0.7919933 0.6930442 0.26375839 0.15990477 0.090690587
## 8 0.9142751 0.8096110 0.9504617 0.05289921 0.08102028 0.049334548
## 9 0.9102315 0.7945727 0.9533575 0.04702192 0.07543999 0.045385934
## 10 0.9302293 0.8125270 0.8384762 0.11510317 0.08152483 0.021824479
## 11 0.9418915 0.8312178 0.9699236 0.05906155 0.09379684 0.018712111
## 12 0.9411693 0.8331314 0.9708471 0.05899344 0.09487014 0.018760298
## sd.spearman sd.kendall sd.q.squared
## 1 0.04039322 0.06626447 0.050097523
## 2 0.03324363 0.06671469 0.012841793
## 3 0.03289490 0.06723819 0.013842410
## 4 0.02662593 0.04486590 0.026471476
## 5 0.02630269 0.05088710 0.005175986
## 6 0.02800890 0.05193071 0.005726293
## 7 0.05659831 0.07296444 0.265949943
## 8 0.06375212 0.08731345 0.056262327
## 9 0.06900359 0.09243878 0.049589749
## 10 0.03004164 0.05664544 0.019874504
## 11 0.02179858 0.04640157 0.018805598
## 12 0.02251280 0.04516270 0.018996378
The extract.fit
function can be used for cv.best.models
and cv.result
objects.
# Extract prediction from the first model of best models.
best <- find.best.models(cv, "q.squared")
fit <- extract.fit(best, 1)
head(fit)
## response prediction index
## 1 1.3 1.498935 3
## 2 1.5 1.512613 40
## 3 1.5 1.455967 49
## 4 4.5 4.262396 67
## 5 4.1 3.686600 68
## 6 4.8 4.586651 71
# Extract the 10th model as a cv.result object.
result <- extract.result(cv, 10)
# Extract prediction of the 10th model.
fit <- extract.fit(result)
head(fit)
## response prediction index
## 1 1.4 2.275855 1
## 2 1.4 2.238841 2
## 3 1.6 2.238841 30
## 4 1.4 2.275855 34
## 5 1.3 2.238841 43
## 6 4.9 4.379440 53
You can use plot
function to show relationship between predicted and actual values. Each point represents pair of prediction and actual values, and the line represents \(Y = X\). Currently, plotting can’t work with classification model.
# Plot predicted and actual values obtained from lm model.
cv <- cv.models(
lm(Petal.Length ~ ., data = iris)
)
plot(cv)
# See the effect of hyper parameter on predictive ability of the model.
cv <- cv.models(
gbm(
Petal.Length ~ ., data = iris, weights = iris$Sepal.Width,
distribution = "gaussian", n.cores = 1
),
grid = list(interaction.depth = c(1, 5), n.minobsinnode = c(1, 10)),
grid.predict = list(n.trees = c(10, 50, 80))
)
# Show the relationship between prediction and actual values in the 10th result.
plot(cv, 10)
to be continued…
When using gbm
, please specify a larger or equal value of n.trees
in gbm
than the values used for grid.predict
. When a smaller value of n.trees
is used in the call of gbm
compared with values used for grid.predict
, larger values of n.trees
than the value specified in the call have no effect on the result.
# gbm is called with n.trees = 10,
# but n.trees of 5, 10, 50 and 80 is used in grid.predict.
cv <- cv.models(
gbm(
Petal.Length ~ ., data = iris, weights = iris$Sepal.Width,
distribution = "gaussian", n.trees = 10
),
grid.predict = list(n.trees = c(5, 10, 50, 80))
)
# In this situation, values of n.trees more than the value specified in gbm (10)
# in grid.predict has no effect on predictive ability of the model and
# same results are returned.
print(cv)
## Result of cross validation
## Cross validation metrics:
## n.trees mse rmse r.squared spearman kendall q.squared
## 1 5 1.3678147 1.1637576 0.8791430 0.8393686 0.7300279 0.5354089
## 2 10 0.7024126 0.8305721 0.9251183 0.9243442 0.8299453 0.7623343
## 3 50 0.7024126 0.8305721 0.9251183 0.9243442 0.8299453 0.7623343
## 4 80 0.7024126 0.8305721 0.9251183 0.9243442 0.8299453 0.7623343
## sd.mse sd.rmse sd.r.squared sd.spearman sd.kendall sd.q.squared
## 1 0.2937742 0.1223969 0.09564163 0.04474389 0.04746734 0.06538468
## 2 0.2003206 0.1181459 0.05797862 0.03408190 0.05040874 0.05462135
## 3 0.2003206 0.1181459 0.05797862 0.03408190 0.05040874 0.05462135
## 4 0.2003206 0.1181459 0.05797862 0.03408190 0.05040874 0.05462135
When formula
is defined outside of function call, and weights
is specified following way, cv.models
can’t work with gbm
.
# This can work.
cv <- cv.models(
gbm(
Petal.Length ~ ., data = iris, weights = iris$Sepal.Width,
distribution = "gaussian", n.trees = 10
),
n.trees = 10, n.cores = 1
)
# For unknown reason, this can't work.
f <- Petal.Length ~ .
cv <- cv.models(
gbm(
f, data = iris, weights = iris$Sepal.Width,
distribution = "gaussian", n.trees = 10
),
n.trees = 10, n.cores = 1
)
For a workaround, you can use weights
by specifying without iris$
.
f <- Petal.Length ~ .
cv <- cv.models(
gbm(
f, data = iris, weights = Sepal.Width,
distribution = "gaussian", n.trees = 10
),
n.trees = 10, n.cores = 1
)
Some functions use additional data such as weights
argument of glm
and gbm
. In this situation, function call produced by call
can’t be used for cv.models
.
# call object produced by call function with weights can't be used for cv.models.
call.gbm <- call(
"gbm", Petal.Length ~ ., data = iris, weights = iris$Sepal.Width,
distribution = "gaussian", n.trees = 10, n.cores = 1
)
# Error!
cv <- cv.models(call.gbm, n.trees = 10)
# Same for GLM.
call.glm <- call(
"glm", Petal.Length ~ ., data = iris, weights = iris$Sepal.Width
)
cv <- cv.models(call.glm)
For a workaround please use substitute
function instead of call
function. Also, please use variable name without iris$
for weights
.
# Making call with substitute.
call.gbm <- substitute(
gbm(
Petal.Length ~ ., data = iris, weights = Sepal.Width,
distribution = "gaussian", n.trees = 10, n.cores = 1
)
)
# OK!
cv <- cv.models(call.gbm, n.trees = 10)
# Same for GLM.
call.glm <- substitute(
glm(Petal.Length ~ ., data = iris, weights = Sepal.Width)
)
cv <- cv.models(call.glm)
offset
termspredict
methods of some models don’t adjust offset terms. Therefore, cv.models
adjusts effect of offset terms if:
formula = y ~ x + offset(param)
or offset = param
, rather than formula = y ~ x + offset(data$param)
or offset = data$param
.glmmML
and ranger
When adjusting offset for models with link function, cv.models
assumes that offset values in the data
is response scale and transformed in the model call as follows. When pre-transformed data is specified for offset term, the adjustment doesn’t work correctly. Note that because cv.models
can’t know transformation of data, it can’t produce warning or error messages.
cv <- cv.models(
glm(Petal.Length ~ Sepal.Width + offset(log(Petal.Width)), data = iris)
)
ranger::ranger
It’s not clear whether predict
method of ranger::ranger
adjusts offset terms or not. Therefore, currently cv.models
doesn’t adjust offset terms for ranger
models.
glmmML::glmmML
glmmML
function in glmmML
package doesn’t provide predict
method therefore cv.models
implemented predict
method for it. However, because the way of offset adjustment for glmmML
is not clear, curent version of cv.models
doesn’t adjust offset terms for glmmML
.
Currently, cv.models
use population-level prediction (i.e., prediction ignoring random effect) for performance evaluation.
To many kinds of complicated objects
The functions in this package create cv.models
, cv.best.models
and cv.result
objects. This may be refactored in the future version.
Almost no tests!
Test codes for the package is still in development. Some function may not be working correctly.
Always AUC become more than 0.5
When the model is worse than random prediction, i.e., if it produce reversed prediction, pROC::coords switches positive and negative class when calculating AUC.
I’m not sure for the way of AUC calculation in OptimalCutpoints package.
Definition of MSE
Some literature defined MSE as: MSE = sum((response - predict) ^ 2) / (sample size - number of parameter) This is different for the definition used for the one used in this package.
Best model selection with Diagnostic Likelihood Ratio
Currently, cv.models
considers models with higher Diagnostic Likelihood Ratio for better model, but I’m not confident this is OK. If you have some reference for the answer, please let me know.
group
argument which supply user defined grouping for cv.models
.model.adapter
package ver.0.1.0.cv.models
does not run cross validation when called with call()
function or model object.find.best.models
uses seed of random number specified in cv.models
to fix the result.pROC
package is replaced with OptimalCutpoints
package and optimal threshold can be calculated other criteria than Youden’s J. By this modification, cv.models
doesn’t calculate metrics such as 1-npv
and does calculate Diagnostic Likelihood Ratio.