Title: | Building Regression and Classification Models |
---|---|
Description: | Consistent user interface to the most common regression and classification algorithms, such as random forest, neural networks, C5 trees and support vector machines, complemented with a handful of auxiliary functions, such as variable importance and a tuning function for the parameters. |
Authors: | Andri Signorell [aut, cre], Bernhard Compton [ctb], Marcel Dettling [ctb], Alexandre Hainard [ctb], Max Kuhn [ctb], Frédérique Lisacek [ctb], Michal Majka [ctb], Markus Müller [ctb], Dan Putler [ctb], Jean-Charles Sanchez [ctb], Natalia Tiberti [ctb], Natacha Turck [ctb], Jarek Tuszynski [ctb], Robin Xavier [ctb], Achim Zeileis [ctb] |
Maintainer: | Andri Signorell <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.9.15 |
Built: | 2025-02-13 06:31:48 UTC |
Source: | https://github.com/andrisignorell/modtools |
There is a rich selection of R packages implementing algorithms for classification and regression tasks out there. The authors legitimately take the liberty to tailor the function interfaces according to their own taste and needs. For us other users, however, this often results in struggling with user interfaces, some of which are rather weird - to put it mildly - and almost always different in terms of arguments and result structures.
ModTools pursues the goal of offering uniform handling for the most important regression and classification models in applied data analyses.
The function FitMod()
is designed as a simple and consistent interface to these original functions while maintaining the flexibility to pass on all possible arguments. print
, plot
, summary
and predict
operations can so be carried out following the same logic. The results will again be reshaped to a reasonable standard.
For all the functions of this package Google styleguides are used as naming rules (in absence of convincing alternatives). The 'BigCamelCase' style has been consequently applied to functions borrowed from contributed R packages as well.
As always: Feedback, feature requests, bugreports and other suggestions are welcome!
The ModTools::FitMod())
function comprises interfaces to the following models:
Regression: | |
lm() |
Linear model OLS (base) |
lmrob() |
Robust linear model (robustbase) |
poisson() |
GLM model with family poisson (base) |
negbin() |
GLM model with family negative.binomial (MASS) |
gamma() |
GLM model with family gamma (base) |
tobit() |
Tobit model for censored responses (package AER) |
Classification: | |
lda() |
Linear discriminant analysis (MASS) |
qda() |
Quadratic discriminant analysis (MASS) |
logit() |
Logistic Regression model glm , family binomial(logit) (base) |
multinom() |
Multinomial Regression model (nnet) |
polr() |
Proportional odds model (MASS) |
rpart() |
Regression and classification trees (rpart) |
nnet() |
Neuronal networks (nnet) |
randomForest() |
Random forests (randomForest) |
C5.0() |
C5.0 tree (C50) |
svm() |
Support vector machines (e1071) |
naive_bayes() |
Naive Bayes classificator (naivebayes) |
LogitBoost() |
Logit boost (using decision stumps as weak learners) (ModTools) |
Preprocess: | |
SplitTrainTest() |
Splits a data frame or index vector into a training and a test sample |
OverSample() |
Get balanced datasets by sampling with replacement. |
Manipulating rpart objects: |
|
CP() |
Extract and plot complexity table of an rpart tree. |
Node() |
Accessor to the most important properties of a node, being a split or a leaf. |
Rules() |
Extract the decision rules from top to the end node of an rpart tree. |
LeafRates() |
Returns the misclassification rates in all end nodes. |
Prediction and Validation: | |
Response() |
Extract the response variable of any model. |
predict() |
Consistent predict for FitMod models |
VarImp() |
Variable importance for most FitMod models |
ROC() |
ROC curves for all dichotomous classification FitMod models |
BestCut() |
Find the optimal cut for a classification based on the ROC curve. |
PlotLift() |
Produces a lift chart for a binary classification model |
TModC() |
Aggregated results for multiple FitMod classification models |
Tune() |
Tuning approaches to find optimal parameters for FitMod classification models. |
RobSummary() |
Robust summary for GLM models (poisson). |
Tests: | |
BreuschPaganTest() |
Breusch-Pagan test against heteroskedasticity. |
This package is still under development. You should be aware that everything in the package might be subject to change. Backward compatibility is not yet guaranteed. Functions may be deleted or renamed and new syntax may be inconsistent with earlier versions. By release of version 1.0 the "deprecated-defunct process" will be installed.
Andri Signorell
Helsana Versicherungen AG, Health Sciences, Zurich
HWZ University of Applied Sciences in Business Administration Zurich.
Includes R source code and/or documentation previously published by (in alphabetical order):
Bernhard Compton, Marcel Dettling, Max Kuhn, Michal Majka, Dan Putler, Jarek Tuszynski, Robin Xavier, Achim Zeileis
The good things come from all these guys, any problems are likely due to my tweaking.
Thank you all!
Maintainer: Andri Signorell <[email protected]>
r.swiss <- FitMod(Fertility ~ ., swiss, fitfn="lm") r.swiss # PlotTA(r.swiss) # PlotQQNorm(r.swiss) ## Count models data(housing, package="MASS") # poisson count r.pois <- FitMod(Freq ~ Infl*Type*Cont + Sat, family=poisson, data=housing, fitfn="poisson") # negative binomial count r.nb <- FitMod(Freq ~ Infl*Type*Cont + Sat, data=housing, fitfn="negbin") summary(r.nb) r.log <- FitMod(log(Freq) ~ Infl*Type*Cont + Sat, data=housing, fitfn="lm") summary(r.log) r.ols <- FitMod(Freq ~ Infl*Type*Cont + Sat, data=housing, fitfn="lm") summary(r.ols) r.gam <- FitMod(Freq ~ Infl*Type*Cont + Sat, data=housing, fitfn="gamma") summary(r.gam) r.gami <- FitMod(Freq ~ Infl*Type*Cont + Sat, data=housing, fitfn="gamma", link="identity") summary(r.gami) old <-options(digits=3) TMod(r.pois, r.nb, r.log, r.ols, r.gam, r.gami) options(old) ## Ordered Regression r.polr <- FitMod(Sat ~ Infl + Type + Cont, data=housing, fitfn="polr", weights = Freq) # multinomial Regression # r.mult <- FitMod(factor(Sat, ordered=FALSE) ~ Infl + Type + Cont, data=housing, # weights = housing$Freq, fitfn="multinom") # Regression tree r.rp <- FitMod(factor(Sat, ordered=FALSE) ~ Infl + Type + Cont, data=housing, weights = housing$Freq, fitfn="rpart") # compare predictions d.p <- expand.grid(Infl=levels(housing$Infl), Type=levels(housing$Type), Cont=levels(housing$Cont)) d.p$polr <- predict(r.polr, newdata=d.p) # ?? # d.p$ols <- factor(round(predict(r.ols, newdata=d.p)^2), labels=levels(housing$Sat)) # d.p$mult <- predict(r.mult, newdata=d.p) d.p$rp <- predict(r.rp, newdata=d.p, type="class") d.p # Classification with 2 classes *************** r.pima <- FitMod(diabetes ~ ., d.pima, fitfn="logit") r.pima Conf(r.pima) plot(ROC(r.pima)) OddsRatio(r.pima) # rpart tree rp.pima <- FitMod(diabetes ~ ., d.pima, fitfn="rpart") rp.pima Conf(rp.pima) lines(ROC(rp.pima), col=hblue) # to be improved plot(rp.pima, col=SetAlpha(c("blue","red"), 0.4), cex=0.7) # Random Forest rf.pima <- FitMod(diabetes ~ ., d.pima, method="class", fitfn="randomForest") rf.pima Conf(rf.pima) lines(ROC(r.pima), col=hred) # more models to compare d.pim <- SplitTrainTest(d.pima, p = 0.2) mdiab <- formula(diabetes ~ pregnant + glucose + pressure + triceps + insulin + mass + pedigree + age) r.glm <- FitMod(mdiab, data=d.pim$train, fitfn="logit") r.rp <- FitMod(mdiab, data=d.pim$train, fitfn="rpart") r.rf <- FitMod(mdiab, data=d.pim$train, fitfn="randomForest") r.svm <- FitMod(mdiab, data=d.pim$train, fitfn="svm") r.c5 <- FitMod(mdiab, data=d.pim$train, fitfn="C5.0") r.nn <- FitMod(mdiab, data=d.pim$train, fitfn="nnet") r.nb <- FitMod(mdiab, data=d.pim$train, fitfn="naive_bayes") r.lda <- FitMod(mdiab, data=d.pim$train, fitfn="lda") r.qda <- FitMod(mdiab, data=d.pim$train, fitfn="qda") r.lb <- FitMod(mdiab, data=d.pim$train, fitfn="lb") mods <- list(glm=r.glm, rp=r.rp, rf=r.rf, svm=r.svm, c5=r.c5 , nn=r.nn, nb=r.nb, lda=r.lda, qda=r.qda, lb=r.lb) # insight in the Regression tree plot(r.rp, box.palette = as.list(Pal("Helsana", alpha = 0.5))) # Insample accuracy ... TModC(mods, ord="auc") # ... is substantially different from the out-of-bag: TModC(mods, newdata=d.pim$test, reference=d.pim$test$diabetes, ord="bs") # C5 and SVM turn out to be show-offs! They overfit quite ordinary # whereas randomforest and logit keep their promises. ... sapply(mods, function(z) VarImp(z)) # Multinomial classification problem with n classes *************** d.gl <- SplitTrainTest(d.glass, p = 0.2) mglass <- formula(Type ~ RI + Na + Mg + Al + Si + K + Ca + Ba + Fe) # *** raises an unclear error in CRAN-Debian tests *** ?? # r.mult <- FitMod(mglass, data=d.gl$train, maxit=600, fitfn="multinom") r.rp <- FitMod(mglass, data=d.gl$train, fitfn="rpart") r.rf <- FitMod(mglass, data=d.gl$train, fitfn="randomForest") r.svm <- FitMod(mglass, data=d.gl$train, fitfn="svm") r.c5 <- FitMod(mglass, data=d.gl$train, fitfn="C5.0") r.nn <- FitMod(mglass, data=d.gl$train, fitfn="nnet") r.nbay <- FitMod(mglass, data=d.gl$train, fitfn="naive_bayes") r.lda <- FitMod(mglass, data=d.gl$train, fitfn="lda") # r.qda <- FitMod(mglass, data=d.glass, fitfn="qda") r.lb <- FitMod(mglass, data=d.gl$train, fitfn="lb") mods <- list(rp=r.rp, rf=r.rf, svm=r.svm, c5=r.c5, nn=r.nn, nbay=r.nbay, lda=r.lda, lb=r.lb) # confusion matrix and other quality measures can be calculated with Conf() Conf(r.rf) # we only extract the general accuracy sapply(lapply(mods, function(z) Conf(z)), "[[", "acc") # let's compare r.mult with a model without RI as predictor # Conf(r.mult) # Conf(update(r.mult, . ~ . -RI))
r.swiss <- FitMod(Fertility ~ ., swiss, fitfn="lm") r.swiss # PlotTA(r.swiss) # PlotQQNorm(r.swiss) ## Count models data(housing, package="MASS") # poisson count r.pois <- FitMod(Freq ~ Infl*Type*Cont + Sat, family=poisson, data=housing, fitfn="poisson") # negative binomial count r.nb <- FitMod(Freq ~ Infl*Type*Cont + Sat, data=housing, fitfn="negbin") summary(r.nb) r.log <- FitMod(log(Freq) ~ Infl*Type*Cont + Sat, data=housing, fitfn="lm") summary(r.log) r.ols <- FitMod(Freq ~ Infl*Type*Cont + Sat, data=housing, fitfn="lm") summary(r.ols) r.gam <- FitMod(Freq ~ Infl*Type*Cont + Sat, data=housing, fitfn="gamma") summary(r.gam) r.gami <- FitMod(Freq ~ Infl*Type*Cont + Sat, data=housing, fitfn="gamma", link="identity") summary(r.gami) old <-options(digits=3) TMod(r.pois, r.nb, r.log, r.ols, r.gam, r.gami) options(old) ## Ordered Regression r.polr <- FitMod(Sat ~ Infl + Type + Cont, data=housing, fitfn="polr", weights = Freq) # multinomial Regression # r.mult <- FitMod(factor(Sat, ordered=FALSE) ~ Infl + Type + Cont, data=housing, # weights = housing$Freq, fitfn="multinom") # Regression tree r.rp <- FitMod(factor(Sat, ordered=FALSE) ~ Infl + Type + Cont, data=housing, weights = housing$Freq, fitfn="rpart") # compare predictions d.p <- expand.grid(Infl=levels(housing$Infl), Type=levels(housing$Type), Cont=levels(housing$Cont)) d.p$polr <- predict(r.polr, newdata=d.p) # ?? # d.p$ols <- factor(round(predict(r.ols, newdata=d.p)^2), labels=levels(housing$Sat)) # d.p$mult <- predict(r.mult, newdata=d.p) d.p$rp <- predict(r.rp, newdata=d.p, type="class") d.p # Classification with 2 classes *************** r.pima <- FitMod(diabetes ~ ., d.pima, fitfn="logit") r.pima Conf(r.pima) plot(ROC(r.pima)) OddsRatio(r.pima) # rpart tree rp.pima <- FitMod(diabetes ~ ., d.pima, fitfn="rpart") rp.pima Conf(rp.pima) lines(ROC(rp.pima), col=hblue) # to be improved plot(rp.pima, col=SetAlpha(c("blue","red"), 0.4), cex=0.7) # Random Forest rf.pima <- FitMod(diabetes ~ ., d.pima, method="class", fitfn="randomForest") rf.pima Conf(rf.pima) lines(ROC(r.pima), col=hred) # more models to compare d.pim <- SplitTrainTest(d.pima, p = 0.2) mdiab <- formula(diabetes ~ pregnant + glucose + pressure + triceps + insulin + mass + pedigree + age) r.glm <- FitMod(mdiab, data=d.pim$train, fitfn="logit") r.rp <- FitMod(mdiab, data=d.pim$train, fitfn="rpart") r.rf <- FitMod(mdiab, data=d.pim$train, fitfn="randomForest") r.svm <- FitMod(mdiab, data=d.pim$train, fitfn="svm") r.c5 <- FitMod(mdiab, data=d.pim$train, fitfn="C5.0") r.nn <- FitMod(mdiab, data=d.pim$train, fitfn="nnet") r.nb <- FitMod(mdiab, data=d.pim$train, fitfn="naive_bayes") r.lda <- FitMod(mdiab, data=d.pim$train, fitfn="lda") r.qda <- FitMod(mdiab, data=d.pim$train, fitfn="qda") r.lb <- FitMod(mdiab, data=d.pim$train, fitfn="lb") mods <- list(glm=r.glm, rp=r.rp, rf=r.rf, svm=r.svm, c5=r.c5 , nn=r.nn, nb=r.nb, lda=r.lda, qda=r.qda, lb=r.lb) # insight in the Regression tree plot(r.rp, box.palette = as.list(Pal("Helsana", alpha = 0.5))) # Insample accuracy ... TModC(mods, ord="auc") # ... is substantially different from the out-of-bag: TModC(mods, newdata=d.pim$test, reference=d.pim$test$diabetes, ord="bs") # C5 and SVM turn out to be show-offs! They overfit quite ordinary # whereas randomforest and logit keep their promises. ... sapply(mods, function(z) VarImp(z)) # Multinomial classification problem with n classes *************** d.gl <- SplitTrainTest(d.glass, p = 0.2) mglass <- formula(Type ~ RI + Na + Mg + Al + Si + K + Ca + Ba + Fe) # *** raises an unclear error in CRAN-Debian tests *** ?? # r.mult <- FitMod(mglass, data=d.gl$train, maxit=600, fitfn="multinom") r.rp <- FitMod(mglass, data=d.gl$train, fitfn="rpart") r.rf <- FitMod(mglass, data=d.gl$train, fitfn="randomForest") r.svm <- FitMod(mglass, data=d.gl$train, fitfn="svm") r.c5 <- FitMod(mglass, data=d.gl$train, fitfn="C5.0") r.nn <- FitMod(mglass, data=d.gl$train, fitfn="nnet") r.nbay <- FitMod(mglass, data=d.gl$train, fitfn="naive_bayes") r.lda <- FitMod(mglass, data=d.gl$train, fitfn="lda") # r.qda <- FitMod(mglass, data=d.glass, fitfn="qda") r.lb <- FitMod(mglass, data=d.gl$train, fitfn="lb") mods <- list(rp=r.rp, rf=r.rf, svm=r.svm, c5=r.c5, nn=r.nn, nbay=r.nbay, lda=r.lda, lb=r.lb) # confusion matrix and other quality measures can be calculated with Conf() Conf(r.rf) # we only extract the general accuracy sapply(lapply(mods, function(z) Conf(z)), "[[", "acc") # let's compare r.mult with a model without RI as predictor # Conf(r.mult) # Conf(update(r.mult, . ~ . -RI))
Returns the best cutpoint for a given classification model.
BestCut(x, method = c("youden", "closest.topleft"))
BestCut(x, method = c("youden", "closest.topleft"))
x |
a roc object from the roc function |
method |
one of |
The method
argument controls how the
optimal threshold is determined.
youden
'Youden's J statistic (Youden, 1950) is employed. The optimal cut-off is the threshold that maximizes the distance to the identity (diagonal) line. Can be shortened to “y”.
The optimality criterion is:
closest.topleft
'The optimal threshold is the point closest to the top-left part of the plot with perfect sensitivity or specificity. Can be shortened to “c” or “t”.
The optimality criterion is:
the threshold value
Robin Xavier <[email protected]>, Andri Signorell <[email protected]> (interface)
Xavier Robin, Natacha Turck, Alexandre Hainard, et al. (2011) “pROC: an open-source package for R and S+ to analyze and compare ROC curves”. BMC Bioinformatics, 7, 77. doi:10.1186/1471-2105-12-77.
r.glm <- FitMod(diabetes ~ ., data = d.pima, fitfn="logit") ROC(r.glm) BestCut(ROC(r.glm))
r.glm <- FitMod(diabetes ~ ., data = d.pima, fitfn="logit") ROC(r.glm) BestCut(ROC(r.glm))
Performs the Breusch-Pagan test against heteroskedasticity.
BreuschPaganTest(formula, varformula = NULL, studentize = TRUE, data = list())
BreuschPaganTest(formula, varformula = NULL, studentize = TRUE, data = list())
formula |
a symbolic description for the model to be tested
(or a fitted |
varformula |
a formula describing only the potential explanatory variables for the variance (no dependent variable needed). By default the same explanatory variables are taken as in the main regression model. |
studentize |
logical. If set to |
data |
an optional data frame containing the variables in the model.
By default the variables are taken from the environment which |
The Breusch-Pagan test fits a linear regression model to the residuals of a linear regression model (by default the same explanatory variables are taken as in the main regression model) and rejects if too much of the variance is explained by the additional explanatory variables.
Under the test statistic of the Breusch-Pagan test follows a
chi-squared distribution with
parameter
(the number of regressors without
the constant in the model) degrees of freedom.
Examples can not only be found on this page, but also on the help pages of the
data sets bondyield
, currencysubstitution
,
growthofmoney
, moneydemand
,
unemployment
, wages
.
A list with class "htest"
containing the following components:
statistic |
the value of the test statistic. |
p.value |
the p-value of the test. |
parameter |
degrees of freedom. |
method |
a character string indicating what type of test was performed. |
data.name |
a character string giving the name(s) of the data. |
Achim Zeileis <[email protected]>
T.S. Breusch & A.R. Pagan (1979), A Simple Test for Heteroscedasticity and Random Coefficient Variation. Econometrica 47, 1287–1294
R. Koenker (1981), A Note on Studentizing a Test for Heteroscedasticity. Journal of Econometrics 17, 107–112.
W. Kraemer & H. Sonnberger (1986), The Linear Regression Model under Test. Heidelberg: Physica
## generate a regressor x <- rep(c(-1,1), 50) ## generate heteroskedastic and homoskedastic disturbances err1 <- rnorm(100, sd=rep(c(1,2), 50)) err2 <- rnorm(100) ## generate a linear relationship y1 <- 1 + x + err1 y2 <- 1 + x + err2 ## perform Breusch-Pagan test BreuschPaganTest(y1 ~ x) BreuschPaganTest(y2 ~ x)
## generate a regressor x <- rep(c(-1,1), 50) ## generate heteroskedastic and homoskedastic disturbances err1 <- rnorm(100, sd=rep(c(1,2), 50)) err2 <- rnorm(100) ## generate a linear relationship y1 <- 1 + x + err1 y2 <- 1 + x + err2 ## perform Breusch-Pagan test BreuschPaganTest(y1 ~ x) BreuschPaganTest(y2 ~ x)
Calculate the confidence interval for the difference of two coefficients in a linear model.
CoeffDiffCI(x, coeff, conf.level = 0.95, sides = c("two.sided", "left", "right"))
CoeffDiffCI(x, coeff, conf.level = 0.95, sides = c("two.sided", "left", "right"))
x |
the linear model object |
coeff |
a vector of length two, containing either the names or the index of the two coefficients whose difference should be used |
conf.level |
confidence level of the interval. |
sides |
a character string specifying the side of the confidence interval, must be one of |
This is quite useful in the course of the modelling process.
a numeric vector with 3 elements:
mean |
mean |
lwr.ci |
lower bound of the confidence interval |
upr.ci |
upper bound of the confidence interval |
Andri Signorell <[email protected]>
# get some model first... r.lm <- FitMod(Fertility ~ ., data=swiss, fitfn="lm") # calculate the confidence interval for the difference of the # coefficients Examination and Education CoeffDiffCI(r.lm, c("Examination", "Education")) # the test could be calculated as car::linearHypothesis(r.lm, "Education = Examination")
# get some model first... r.lm <- FitMod(Fertility ~ ., data=swiss, fitfn="lm") # calculate the confidence interval for the difference of the # coefficients Examination and Education CoeffDiffCI(r.lm, c("Examination", "Education")) # the test could be calculated as car::linearHypothesis(r.lm, "Education = Examination")
Extracts, prints and plots the complexity table of an rpart
model.
CP(x, ...) ## S3 method for class 'CP' print(x, digits = getOption("digits") - 2L, ...) ## S3 method for class 'CP' plot(x, minline = TRUE, lty = 3, col = 1, upper = c("size", "splits", "none"), ...)
CP(x, ...) ## S3 method for class 'CP' print(x, digits = getOption("digits") - 2L, ...) ## S3 method for class 'CP' plot(x, minline = TRUE, lty = 3, col = 1, upper = c("size", "splits", "none"), ...)
x |
fitted model object of class |
digits |
the number of digits of numbers to print. |
minline |
whether a horizontal line is drawn 1SE above the minimum of the curve. |
lty |
line type for this line |
col |
colour for this line |
upper |
what is plotted on the top axis: the size of the tree (the number of leaves) (" |
... |
further arguments passed to |
The complexity parameter table is hidden deep in the entrails of the rpart
result object, it is convenient to have a function to extract it.
A list containing the following components:
cp |
the complexity table |
x |
the |
Andri Signorell <[email protected]>
r.rp <- FitMod(diabetes ~ ., d.pima, fitfn="rpart") CP(r.rp) plot(CP(r.rp))
r.rp <- FitMod(diabetes ~ ., d.pima, fitfn="rpart") CP(r.rp) plot(CP(r.rp))
The d.glass
data frame has 214 rows and 10 columns.
It was collected by B. German on fragments of glass
collected in forensic work.
d.glass
d.glass
This data frame contains the following columns:
RI
refractive index; more precisely the refractive index is 1.518xxxx.
The next 8 measurements are percentages by weight of oxides.
Na
sodium.
Mg
manganese.
Al
aluminium.
Si
silicon.
K
potassium.
Ca
calcium.
Ba
barium.
Fe
iron.
Type
The fragments were originally classed into seven types, one of which
was absent in this dataset. The categories which occur are
window float glass (WinF
: 70),
window non-float glass (WinNF
: 76),
vehicle window glass (Veh
: 17),
containers (Con
: 13),
tableware (Tabl
: 9) and
vehicle headlamps (Head
: 29).
Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer.
The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix.
data(d.pima) data(d.pima2)
data(d.pima) data(d.pima2)
The dataset contains the following variables
pregnant
Number of times pregnant
glucose
Plasma glucose concentration at 2 hours in an oral glucose tolerance test
pressure
Diastolic blood pressure (mm Hg)
triceps
Triceps skin fold thickness (mm)
insulin
2-Hour serum insulin (mu U/ml)
mass
Body mass index (weight in kg/(height in metres squared))
pedigree
Diabetes pedigree function
age
Age (years)
diabetes
test whether the patient shows signs of
diabetes (coded neg
if negative, pos
if positive)
d.pima2
is the same dataset as d.pima
with the only change, that invalid 0-values are replaced by NA
s.
This dataset has been borrowed from Julian Faraway's package:
faraway: Functions and datasets for books by Julian Faraway, 2015
The data may also
be obtained from the package MASS
.
Popular implementations of algorithms are characterized by partly unconventional implementations of the operating standards in R. For example, the function e1071::SVM()
returns the predicted values as attributes! FitMod()
is designed as a wrapping function to offer a consistent interface for a selection of most often used classification and regression models.
FitMod(formula, data, ..., subset, na.action = na.pass, fitfn = NULL) ## S3 method for class 'FitMod' predict(object, ...) ## S3 method for class 'FitMod' plot(x, ...) ## S3 method for class 'FitMod' summary(object, ...) ## S3 method for class 'FitMod' drop1(object, ...)
FitMod(formula, data, ..., subset, na.action = na.pass, fitfn = NULL) ## S3 method for class 'FitMod' predict(object, ...) ## S3 method for class 'FitMod' plot(x, ...) ## S3 method for class 'FitMod' summary(object, ...) ## S3 method for class 'FitMod' drop1(object, ...)
x |
a fitted object of class |
formula |
a formula expression as for classification and regression models, of the form |
data |
an optional data frame in which to interpret the variables occurring in formula. |
subset |
expression saying which subset of the rows of the data should be used in the fit. All observations are included by default. |
na.action |
a function to filter missing data. |
fitfn |
code for the fitting function to be used for regression or classifying. So far implemented are: |
object |
the model object. |
... |
further arguments passed to the underlying functions. |
The function will in general return the original object, extended by a further class FitMod
, which allows to capture the output and plot routines.
The classifying algorithms will at the minimum offer the predicting options type = c("class", "prob")
additionally to those implemented by the underlying function.
model object as returned by the calculating function extended with the FitMod
class.
Andri Signorell <[email protected]>
r.lm <- FitMod(Fertility ~ ., data=swiss, fitfn="lm") r.logit <- FitMod(diabetes ~ glucose + pressure + mass + age, data=d.pima, fitfn="logit") r.svm <- FitMod(diabetes ~ glucose + pressure + mass + age, data=d.pima, fitfn="svm")
r.lm <- FitMod(Fertility ~ ., data=swiss, fitfn="lm") r.logit <- FitMod(diabetes ~ glucose + pressure + mass + age, data=d.pima, fitfn="logit") r.svm <- FitMod(diabetes ~ glucose + pressure + mass + age, data=d.pima, fitfn="svm")
Return the frequencies of correct and wrong classifications in given node(s) in tabular form. The 'purity', denoting the relative frequency of correctly classified elements, is a useful information for the interpretation of regression and classification trees and a measure for its quality.
LeafRates(x) ## S3 method for class 'LeafRates' plot(x, col = NULL, which = c("rel", "abs"), layout = NULL, ylim = NULL, ...)
LeafRates(x) ## S3 method for class 'LeafRates' plot(x, col = NULL, which = c("rel", "abs"), layout = NULL, ylim = NULL, ...)
x |
fitted model object of class |
col |
color for the bars in the plot |
which |
one out of |
layout |
vector defining the layout |
ylim |
the y limits of the plot. |
... |
further arguments (not used). |
The result comprises absolute and relative frequencies per leaf.
A list with 5 elements consisting of:
node |
the node id (of the leaf) |
freq |
the absolute frequency of correct and wrong classifications |
p.row |
the relative frequency of correct and wrong classifications |
mfreq |
the total number of cases |
mperc |
the percentage of the sample in the leaf |
Andri Signorell <[email protected]>
r.rp <- FitMod(Species ~ ., data=iris, fitfn="rpart") LeafRates(r.rp) plot(LeafRates(r.rp))
r.rp <- FitMod(Species ~ ., data=iris, fitfn="rpart") LeafRates(r.rp) plot(LeafRates(r.rp))
Train logitboost classification algorithm using decision stumps (one node decision trees) as weak learners.
LogitBoost(x, ...) ## S3 method for class 'formula' LogitBoost(formula, data, ..., subset, na.action) ## Default S3 method: LogitBoost(x, y, nIter=ncol(x), ...)
LogitBoost(x, ...) ## S3 method for class 'formula' LogitBoost(formula, data, ..., subset, na.action) ## Default S3 method: LogitBoost(x, y, nIter=ncol(x), ...)
formula |
a formula expression as for regression models, of the form |
data |
an optional data frame in which to interpret the variables occurring in formula. |
... |
additional arguments for nnet |
subset |
expression saying which subset of the rows of the data should be used in the fit. All observations are included by default. |
na.action |
a function to filter missing data. |
x |
A matrix or data frame with training data. Rows contain samples and columns contain features |
y |
Class labels for the training data samples.
A response vector with one label for each row/component of |
nIter |
An integer, describing the number of iterations for which boosting should be run, or number of decision stumps that will be used. |
The function was adapted from logitboost.R function written by Marcel
Dettling. See references and "See Also" section. The code was modified in
order to make it much faster for very large data sets. The speed-up was
achieved by implementing a internal version of decision stump classifier
instead of using calls to rpart
. That way, some of the most time
consuming operations were precomputed once, instead of performing them at
each iteration. Another difference is that training and testing phases of the
classification process were split into separate functions.
An object of class "LogitBoost" including components:
Stump |
List of decision stumps (one node decision trees) used:
If there are more than two classes, than several "Stumps" will be
|
lablist |
names of each class |
Jarek Tuszynski (SAIC) [email protected]
Dettling and Buhlmann (2002), Boosting for Tumor Classification of Gene Expression Data.
# basic interface r.lb <- LogitBoost(Species ~ ., data=iris, nIter=20) pred <- predict(r.lb) prob <- predict(r.lb, type="prob") d.res <- data.frame(pred, prob) d.res[1:10, ] # accuracy increases with nIter (at least for train set) table(predict(r.lb, iris, type="class", nIter= 2), iris$Species) table(predict(r.lb, iris, type="class", nIter=10), iris$Species) table(predict(r.lb, iris, type="class"), iris$Species) # example of spliting the data into train and test set d.set <- SplitTrainTest(iris) r.lb <- LogitBoost(Species ~ ., data=d.set$train, nIter=10) table(predict(r.lb, d.set$test, type="class", nIter=2), d.set$test$Species) table(predict(r.lb, d.set$test, type="class"), d.set$test$Species)
# basic interface r.lb <- LogitBoost(Species ~ ., data=iris, nIter=20) pred <- predict(r.lb) prob <- predict(r.lb, type="prob") d.res <- data.frame(pred, prob) d.res[1:10, ] # accuracy increases with nIter (at least for train set) table(predict(r.lb, iris, type="class", nIter= 2), iris$Species) table(predict(r.lb, iris, type="class", nIter=10), iris$Species) table(predict(r.lb, iris, type="class"), iris$Species) # example of spliting the data into train and test set d.set <- SplitTrainTest(iris) r.lb <- LogitBoost(Species ~ ., data=d.set$train, nIter=10) table(predict(r.lb, d.set$test, type="class", nIter=2), d.set$test$Species) table(predict(r.lb, d.set$test, type="class"), d.set$test$Species)
The rpart
result object has a complex and compact design. This can make practical use tedious for occasional users as it is difficult to figure out how to access some specific information. The function Node()
is designed as accessor to the most important properties of a node, being a 'split' or a 'leaf' (aka. 'endnode'). It also serves as base for further convenience functions as e.g. LeafRates()
.
Node(x, node = NULL, type = c("all", "split", "leaf"), digits = 3)
Node(x, node = NULL, type = c("all", "split", "leaf"), digits = 3)
x |
fitted model object of class |
node |
integer vector, defining the nodes whose details are required. |
type |
one out of |
digits |
the number of digits for numeric values |
Node()
returns detailed information for a single node in the tree. It reports all the data in the summary of a node, but with the option to provide a nodelist. The structure of the result is organised as a list.
A list containing:
id |
int, id of the node |
vname |
character, one out of |
isleaf |
logical, |
nobs |
integer, number of observation in the node |
group |
character, the predicted class for the node |
ycount |
numeric, the number of observation per class in the node |
yprob |
numeric, the relative frequencies for the each class |
nodeprob |
the global probability for an observation to fall in the node |
complexity |
numeric, the complexity parameter for the node |
tprint |
character, the text to be printed |
Andri Signorell <[email protected]>
r.rpart <- FitMod(Species ~ ., data=iris, fitfn="rpart") # return Node nr. 3 Node(r.rpart, node=3) r.rp <- FitMod(Type ~ ., data = d.glass, fitfn="rpart") # return all the splits Node(r.rpart, type="split")
r.rpart <- FitMod(Species ~ ., data=iris, fitfn="rpart") # return Node nr. 3 Node(r.rpart, node=3) r.rp <- FitMod(Type ~ ., data = d.glass, fitfn="rpart") # return all the splits Node(r.rpart, type="split")
For classification purposes we might want to have balanced datasets. If the response variable has not a prevalence of 50%, we can sample records for getting as much response A cases as response B. This is called oversample. Undersample means to sample the (lower) number of cases A from the records of case B.
OverSample(x, vname) UnderSample(x, vname)
OverSample(x, vname) UnderSample(x, vname)
x |
a data frame containing predictors and response |
vname |
the name of the response variable to be used to over/undersample |
a data frame with balanced response variable
Andri Signorell <[email protected]>
OverSample(d.pima2, "diabetes") UnderSample(d.pima2, "diabetes")
OverSample(d.pima2, "diabetes") UnderSample(d.pima2, "diabetes")
Provides either a total cumulative response or incremental response rate lift chart for the purposes of comparing the predictive capability of different binary predictive models.
PlotLift(modelList, data, targLevel, trueResp, type = "cumulative", sub = "")
PlotLift(modelList, data, targLevel, trueResp, type = "cumulative", sub = "")
modelList |
A character vector containing the names of the different models to be compared. The selected models must have the same y variable that must be a binary factor, and have been estimated using the same data set. |
data |
The dataframe that constitues the comparison sample. If this dataframe is not the same as the dataframe used to estimated models, the dataframe must contain all the variables used in the models to be compared. |
targLevel |
The label for the level of the binary factor of interest. For example, in a database marketing application, this level could be "Yes" for a variable that takes on the values "Yes" and "No" to indicate if a customer responded favorably to a promotion offer. |
trueResp |
The true rate of the target level for the master database the estimation and comparison dataframes were originally drawn from. |
type |
A character string that must either have the value of "cummulative" (to produce a total cummaltive response chart) or "incremental" (to produce an incremental response rate chart). |
sub |
A sub-title for the plot, typically to identify the sample used. |
Lift charts are a commonly used tool in business data mining applications. They are used to assess how well a model is able to predict a desirable (from an organization's point-of-view) response on the part of a customer compared to alternative estimated models and a benchmark model of approaching customers randomly. The total cummulative response chart shows the percentage of the total response the organization would receive from only contacting a given percentage (grouped by deciles) of its entire customer base. This chart is best for selecting between alternative models, and in predicting the revenues the organization will receive by contacting a given percentage of their customers that the model predicts are most likely to favorably respond. The incremental response rate chart provides the response rate among each of ten decile groups of the organization's customers, with the decile groups ordered by their estimated likelihood of a favorable response.
The function returns the sample response invisibly.
original Dan Putler, tweaks Andri Signorell <[email protected]>
d.pim <- SplitTrainTest(d.pima, p = 0.2) r.rp <- FitMod(diabetes ~ pregnant + glucose + pressure + triceps + insulin + mass + pedigree + age , data=d.pim$train, fitfn="rpart") r.glm <- FitMod(diabetes ~ pregnant + glucose + pressure + triceps + insulin + mass + pedigree + age , data=d.pim$train, fitfn="logit") r.nn <- FitMod(diabetes ~ pregnant + glucose + pressure + triceps + insulin + mass + pedigree + age , data=d.pim$train, fitfn="nnet") oldpar <- par(mfrow=c(1,2)) on.exit(par(oldpar)) PlotLift(c("r.rp", "r.glm", "r.nn"), data = d.pim$train, targLevel = "pos", trueResp =0.34, type = "cumulative") PlotLift(c("r.rp", "r.glm", "r.nn"), data = d.pim$train, targLevel = "pos", trueResp =0.34, type = "incremental")
d.pim <- SplitTrainTest(d.pima, p = 0.2) r.rp <- FitMod(diabetes ~ pregnant + glucose + pressure + triceps + insulin + mass + pedigree + age , data=d.pim$train, fitfn="rpart") r.glm <- FitMod(diabetes ~ pregnant + glucose + pressure + triceps + insulin + mass + pedigree + age , data=d.pim$train, fitfn="logit") r.nn <- FitMod(diabetes ~ pregnant + glucose + pressure + triceps + insulin + mass + pedigree + age , data=d.pim$train, fitfn="nnet") oldpar <- par(mfrow=c(1,2)) on.exit(par(oldpar)) PlotLift(c("r.rp", "r.glm", "r.nn"), data = d.pim$train, targLevel = "pos", trueResp =0.34, type = "cumulative") PlotLift(c("r.rp", "r.glm", "r.nn"), data = d.pim$train, targLevel = "pos", trueResp =0.34, type = "incremental")
Provides confidence intervals for predictions of a GLM.
PredictCI(mod, newdata, conf.level = 0.95)
PredictCI(mod, newdata, conf.level = 0.95)
mod |
the binomial model |
newdata |
the data to be predicted |
conf.level |
confidence level of the interval. Default is 0.95. |
The confidence intervals for predictions are calculated with the se of the model and the normal quantile.
a matrix with 3 columns for the fit, the lower confidence interval and the upper confidence interval
Andri Signorell <[email protected]>
r.logit <- FitMod(diabetes ~ age, d.pima, fitfn = "logit") head(PredictCI(r.logit, newdata=d.pima))
r.logit <- FitMod(diabetes ~ age, d.pima, fitfn = "logit") head(PredictCI(r.logit, newdata=d.pima))
Returns all the reference levels in the factors used in a linear model. It is customer friendly to report also the reference level in lm summaries, which normally are suppressed.
RefLevel(x)
RefLevel(x)
x |
lm object, linear model with factors as predictors. |
For reporting tables of linear models we might want to include an information about the used reference levels, which remain uncommented in the default lm
result output. RefLevel()
allows to add a footnote or integrate the reference levels in the coefficient table.
a named vector containing the reference levels of all factors
It's not clear how general the used algorithm is for more exotic models. dummy.coef
could in such cases be an alternative.
Andri Signorell <[email protected]>
dummy.coef
, Response
, relevel
, lm
RefLevel(lm(breaks ~ wool + tension, data = warpbreaks))
RefLevel(lm(breaks ~ wool + tension, data = warpbreaks))
Time after time, in the course of our daily work, we experience that the response variable is hidden very deeply in the object. This again leads to superfluous consultation of the documentation.
Reponse()
relieves us of this work.
Response(x, ...)
Response(x, ...)
x |
the model to use |
... |
more arguments |
The function implements the extraction of the response variables for all the models listed in the package's help text.
the response of model x
Andri Signorell <[email protected]>
model.frame
, model.response
, RefLevel
r.rpart <- FitMod(diabetes ~ ., d.pima, fitfn="rpart") Response(r.rpart) # up to the attribute "response" this is the same identical(StripAttr(Response(r.rpart), "response"), model.response(model.frame(r.rpart)))
r.rpart <- FitMod(diabetes ~ ., d.pima, fitfn="rpart") Response(r.rpart) # up to the attribute "response" this is the same identical(StripAttr(Response(r.rpart), "response"), model.response(model.frame(r.rpart)))
For poisson models with mild violation of the distribution assumption that the variance equals the mean, Cameron and Trivedi (2009) recommended using robust standard errors for the parameter estimates. The function uses the function vcovHC
from the package sandwich to obtain the robust standard errors and calculate the p-values accordingly.
It returns a matrix containing the usual results in the model summary, comprising the parameter estimates, their robust standard errors, p-values, extended with the 95% confidence interval.
RobSummary(mod, conf.level = 0.95, type = "HC0")
RobSummary(mod, conf.level = 0.95, type = "HC0")
mod |
the model for which robust standard errors should be calculated |
conf.level |
the confidence level, default is 95%. |
type |
a character string specifying the estimation type. Details in |
Further details in https://stats.oarc.ucla.edu/r/dae/poisson-regression/
a p x 6 matrix with columns for the estimated coefficient, its standard error, t- or z-statistic, the corresponding (two-sided) p-value, the lower and upper confidence interval.
Andri Signorell <[email protected]>
Cameron, A. C. and Trivedi, P. K. (2009) Microeconometrics Using Stata. College Station, TX: Stata Press.
r.lm <- lm(Fertility ~ ., swiss) RobSummary(r.lm)
r.lm <- lm(Fertility ~ ., swiss) RobSummary(r.lm)
This is a wrapper to the main function pROC
of the pROC package (by Xavier Robin et al.). It builds a ROC curve and returns a "roc"
object, a list of class "roc"
.
ROC(x, resp = NULL, ...)
ROC(x, resp = NULL, ...)
x |
a model object, or the predicted probabilities, when resp is not |
resp |
the response |
... |
all arguments are passed to |
Partial ROC is calculated following Peterson et al. (2008; doi:10.1016/j.ecolmodel.2007.11.008). This function is a modification of the PartialROC funcion, available at https://github.com/narayanibarve/ENMGadgets.
A data.frame containing the AUC values and AUC ratios calculated for each iteration.
Andri Signorell <[email protected]>
Peterson, A.T. et al. (2008) Rethinking receiver operating characteristic analysis applications in ecological niche modeling. Ecol. Modell., 213, 63-72.
r.glm <- FitMod(diabetes ~ ., data = d.pima, fitfn="logit") ROC(r.glm) # plot ROC curves for a list of models r.rp <- FitMod(diabetes ~ ., data = d.pima, fitfn="rpart") # combine models to a list mlst <- list(r.glm, r.rp) # do the plot for(i in seq_along(mlst)) if(i==1){ plot(ROC(mlst[[i]], grid=TRUE, col=c(hred, hblue)[i])) } else { lines(ROC(mlst[[i]], col=c(hred, hblue)[i])) }
r.glm <- FitMod(diabetes ~ ., data = d.pima, fitfn="logit") ROC(r.glm) # plot ROC curves for a list of models r.rp <- FitMod(diabetes ~ ., data = d.pima, fitfn="rpart") # combine models to a list mlst <- list(r.glm, r.rp) # do the plot for(i in seq_along(mlst)) if(i==1){ plot(ROC(mlst[[i]], grid=TRUE, col=c(hred, hblue)[i])) } else { lines(ROC(mlst[[i]], col=c(hred, hblue)[i])) }
Extract rules from an rpart object. This can be useful, if the rules must be implemented in another system. The rules contain all the criteria for the binary splits of an rpart tree from the root node down to the specified leaf.
Rules(x, node = NULL, leafonly = FALSE)
Rules(x, node = NULL, leafonly = FALSE)
x |
the rpart object to extract the rules from |
node |
integer vector, defining the nodes whose details are required. |
leafonly |
boolean, defining if only the rules leading to end nodes ("leafs") should be returned. |
The function builds upon the original function path.rpart
, which is bulky in some situations.
a list with the rules
frame |
the frame of the rpart |
ylevels |
the y values of the node |
ds.size |
the size of the dataset |
path |
a list of character vecotrs containing the rules |
Andri Signorell <[email protected]>
r.rp <- FitMod(diabetes ~ ., data=d.pima, fitfn="rpart") Rules(r.rp)
r.rp <- FitMod(diabetes ~ ., data=d.pima, fitfn="rpart") Rules(r.rp)
For modeling we usually split our data frame in a train sample, where we train our model on, and a test sample, where we test, how good it works. This function splits a given data frame in two parts, one being the training sample and the other the test sample in form of a list with two elements.
SplitTrainTest(x, p = 0.1, seed = NULL, logical = FALSE)
SplitTrainTest(x, p = 0.1, seed = NULL, logical = FALSE)
x |
data.frame |
p |
proportion for test sample. Default is 10%. |
seed |
initialization for random number generator. |
logical |
logical, defining if a logical vector should be returned or the list with train and test data. Default is |
In order to obtain reasonable models, we should ensure two points. The dataset must be large enough to yield statistically meaningful results and it should be representative of the data set as a whole. Assuming that our test set meets the preceding two conditions, our goal is to create a model that generalizes well to new data. We are aiming for a model that equally well predicts training and test data. We should never train on test data. If we are seeing surprisingly good results on the evaluation metrics, it might be a sign that we're accidentally training on the test set.
If logical
is FALSE
a list with two data frames, train
and test
, of the same structure as the given data in x
if logical
is TRUE
a logical vector containing nrow
elements of TRUE
and FALSE
Andri Signorell <[email protected]>
SplitTrainTest(d.pima)
SplitTrainTest(d.pima)
For the comparison of several classification models, the AUC values and BrierScore values of the models are determined and tabulated. Both the absolute values and the relative values are reported, each related to the model with the highest corresponding value.
TModC(..., newdata = NULL, reference = NULL, ord = NULL) ## S3 method for class 'TModC' plot(x, col = NULL, args.legend = NULL,...)
TModC(..., newdata = NULL, reference = NULL, ord = NULL) ## S3 method for class 'TModC' plot(x, col = NULL, args.legend = NULL,...)
... |
the models to be compared |
x |
|
newdata |
the data to use for predicting. If not provided, the |
reference |
the reference values |
ord |
character defining the order of the results table, can be any of |
col |
the color for the lines in the ROC plot |
args.legend |
the legend to be placed in the ROC plot |
a matrix with the columns
auc |
absolute value of area under the ROC curve (AUC) |
auc_p |
percentage of the auc based on the best observerd AUC |
auc_rnk |
the rank of the auc |
bs |
absolute value of the Brier score |
bs_p |
percentage of the Brier score based on the best observed BS |
bs_rnk |
the rank of the BS |
auc_grnk |
character representation of the AUC rank |
bs_grnk |
character representation of the BS rank |
Andri Signorell <[email protected]>
TMod
, BrierScore
, AUC
, ROC
d.pim <- SplitTrainTest(d.pima, p = 0.2) mdiab <- formula(diabetes ~ pregnant + glucose + pressure + triceps + insulin + mass + pedigree + age) r.glm <- FitMod(mdiab, data=d.pim$train, fitfn="logit") r.rp <- FitMod(mdiab, data=d.pim$train, fitfn="rpart") mods <- list(glm=r.glm, rp=r.rp) # the table with the measures (tm <- TModC(mods, ord="auc")) # plotting the ROC curves plot(tm, col=c("darkmagenta", "dodgerblue"))
d.pim <- SplitTrainTest(d.pima, p = 0.2) mdiab <- formula(diabetes ~ pregnant + glucose + pressure + triceps + insulin + mass + pedigree + age) r.glm <- FitMod(mdiab, data=d.pim$train, fitfn="logit") r.rp <- FitMod(mdiab, data=d.pim$train, fitfn="rpart") mods <- list(glm=r.glm, rp=r.rp) # the table with the measures (tm <- TModC(mods, ord="auc")) # plotting the ROC curves plot(tm, col=c("darkmagenta", "dodgerblue"))
Fitting and testing Tobit regression models for censored data.
Tobit(formula, left = 0, right = Inf, dist = "gaussian", subset = NULL, data = list(), ...)
Tobit(formula, left = 0, right = Inf, dist = "gaussian", subset = NULL, data = list(), ...)
formula |
a symbolic description of a regression model of type
|
left |
left limit for the censored dependent variable |
right |
right limit for the censored dependent variable |
dist |
assumed distribution for the dependent variable |
subset |
a specification of the rows to be used. |
data |
a data frame containing the variables in the model. |
... |
further arguments passed to |
The function Tobit
is an alias for the AER function tobit
(Achim Zeileis <[email protected]>).
All details can be found there.
An object of class "Tobit"
inheriting from class "survreg"
.
Andri Signorell
# still to do
# still to do
Some classifiers benefit more from adjusted parameters to a particular dataset than others. However, it is often not clear from the beginning how the parameters have to be determined. What often only remains is a grid search when several parameters have to be found in combination. The present function uses a grid search approch for the decisive arguments (typically for a neural network, a random forest or a classification tree). However it's not restricted to these models, any model fulfilling weak interface standards could be provided.
Tune(x, ..., testset = NULL, keepmod = TRUE)
Tune(x, ..., testset = NULL, keepmod = TRUE)
x |
the model to be tuned, best (but not necessarily) trained with |
... |
a list of parameters, containing the values to be used for a grid search. |
testset |
a testset containing all variables required in the model to be used for calculating independently the accuracy (normally a subset of the original dataset). |
keepmod |
logical, defining if all fitted models should be returned in the result set. Default is |
The function creates a n-dimensional grid according to the given parameters and calculates the model with the combinations of all the parameters. The accuracy for the models are calculated insample and on a test set, if one has been provided.
It makes sense to avoid overfitting to provide a test set to also be evaluated. A matrix with all combination of the values for the given parameters, sorted by accuracy, either by the accuracy achieved in the test set or the insample accuracy is returned.
a matrix with all supplied parameters and a column "acc"
and "test_acc"
(if a test set has been provided)
Andri Signorell <[email protected]>
d.pim <- SplitTrainTest(d.pima, p = 0.2) mdiab <- formula(diabetes ~ pregnant + glucose + pressure + triceps + insulin + mass + pedigree + age) # tune a neural network for size and decay r.nn <- FitMod(mdiab, data=d.pim$train, fitfn="nnet") (tu <- Tune(r.nn, size=12:17, decay = 10^(-4:-1), testset=d.pim$test)) # tune a random forest r.rf <- FitMod(mdiab, data=d.pim$train, fitfn="randomForest") (tu <- Tune(r.rf, mtry=seq(2, 20, 2), testset=d.pim$test)) # tune a SVM model r.svm <- FitMod(mdiab, data=d.pim$train, fitfn="svm") tu <- Tune(r.svm, kernel=c("radial", "sigmoid"), cost=c(0.1,1,10,100,1000), gamma=c(0.5,1,2,3,4), testset=d.pim$test) # let's get some more quality measures tu$modpar$Sens <- sapply(tu$mods, Sens) # Sensitivity tu$modpar$Spec <- sapply(tu$mods, Spec) # Specificity Sort(tu$modpar, ord="test_acc", decreasing=TRUE)
d.pim <- SplitTrainTest(d.pima, p = 0.2) mdiab <- formula(diabetes ~ pregnant + glucose + pressure + triceps + insulin + mass + pedigree + age) # tune a neural network for size and decay r.nn <- FitMod(mdiab, data=d.pim$train, fitfn="nnet") (tu <- Tune(r.nn, size=12:17, decay = 10^(-4:-1), testset=d.pim$test)) # tune a random forest r.rf <- FitMod(mdiab, data=d.pim$train, fitfn="randomForest") (tu <- Tune(r.rf, mtry=seq(2, 20, 2), testset=d.pim$test)) # tune a SVM model r.svm <- FitMod(mdiab, data=d.pim$train, fitfn="svm") tu <- Tune(r.svm, kernel=c("radial", "sigmoid"), cost=c(0.1,1,10,100,1000), gamma=c(0.5,1,2,3,4), testset=d.pim$test) # let's get some more quality measures tu$modpar$Sens <- sapply(tu$mods, Sens) # Sensitivity tu$modpar$Spec <- sapply(tu$mods, Spec) # Specificity Sort(tu$modpar, ord="test_acc", decreasing=TRUE)
Variable importance is an expression of the desire to know how important a variable is within a group of predictors for a particular model. But in general it is not a well defined concept, say there is no theoretically defined variable importance metric. Nevertheless, there are some approaches that have been established in practice for some regression and classification algorithms.
The present function provides an interface for calculating variable importance for some of the models produced by FitMod
, comprising linear models, classification trees, random forests, C5 trees and neural networks. The intention here is to provide reasonably homogeneous output and plot routines.
VarImp(x, scale = FALSE, sort = TRUE, ...) ## S3 method for class 'FitMod' VarImp(x, scale = FALSE, sort = TRUE, type=NULL, ...) ## Default S3 method: VarImp(x, scale = FALSE, sort = TRUE, ...) ## S3 method for class 'VarImp' plot(x, sort = TRUE, maxrows = NULL, main = "Variable importance", ...) ## S3 method for class 'VarImp' print(x, digits = 3, ...)
VarImp(x, scale = FALSE, sort = TRUE, ...) ## S3 method for class 'FitMod' VarImp(x, scale = FALSE, sort = TRUE, type=NULL, ...) ## Default S3 method: VarImp(x, scale = FALSE, sort = TRUE, ...) ## S3 method for class 'VarImp' plot(x, sort = TRUE, maxrows = NULL, main = "Variable importance", ...) ## S3 method for class 'VarImp' print(x, digits = 3, ...)
x |
the fitted model |
scale |
logical, should the importance values be scaled to 0 and 100? |
... |
parameters to pass to the specific |
sort |
the name of the column, the importance table should be ordered after |
maxrows |
the maximum number of rows to be reported |
main |
the main title for the plot |
type |
some models have more than one type available to produce a variable importance. Linear models accept one of |
digits |
the number of digits for printing the "VarImp" table |
Linear Models:
For linear models there's a fine package relaimpo available on CRAN containing several interesting approaches for quantifying the variable importance. See the original documentation.
rpart, Random Forest:
VarImp.rpart
and VarImp.randomForest
are wrappers around the importance functions from the rpart or randomForest packages, respectively.
C5.0:
C5.0 measures predictor importance by determining the
percentage of training set samples that fall into all the terminal
nodes after the split. For example, the predictor in the first split
automatically has an importance measurement of 100 percent since all
samples are affected by this split. Other predictors may be used
frequently in splits, but if the terminal nodes cover only a handful
of training set samples, the importance scores may be close to
zero. The same strategy is applied to rule-based models and boosted
versions of the model. The underlying function can also return the
number of times each predictor was involved in a split by using the
option metric="usage"
.
Neural Networks:
The method used here is "Garson weights".
SVM, GLM, Multinom:
There are no implementations for these models so far.
A data frame with class c("VarImp.train", "data.frame")
for
VarImp.train
or a matrix for other models.
Andri Signorell <[email protected]>
Quinlan, J. (1992). Learning with continuous classes. Proceedings of the 5th Australian Joint Conference On Artificial Intelligence, 343-348.