This is a convenience wrapper around pense_cv() and regmest_cv(), for the common use-case of computing a highly-robust S-estimate followed by a more efficient M-estimate using the scale of the residuals from the S-estimate.

pensem_cv(x, ...)

# S3 method for default
pensem_cv(
  x,
  y,
  alpha = 0.5,
  nlambda = 50,
  lambda_min_ratio,
  lambda_m,
  lambda_s,
  standardize = TRUE,
  penalty_loadings,
  intercept = TRUE,
  bdp = 0.25,
  ncores = 1,
  sparse = FALSE,
  eps = 1e-06,
  cc = 4.7,
  cv_k = 5,
  cv_repl = 1,
  cl = NULL,
  cv_metric = c("tau_size", "mape", "rmspe"),
  add_zero_based = TRUE,
  explore_solutions = 10,
  explore_tol = 0.1,
  max_solutions = 10,
  fit_all = TRUE,
  comparison_tol = sqrt(eps),
  algorithm_opts = mm_algorithm_options(),
  mscale_opts = mscale_algorithm_options(),
  nlambda_enpy = 10,
  enpy_opts = enpy_options(),
  ...
)

# S3 method for pense_cvfit
pensem_cv(
  x,
  scale,
  alpha,
  nlambda = 50,
  lambda_min_ratio,
  lambda_m,
  standardize = TRUE,
  penalty_loadings,
  intercept = TRUE,
  bdp = 0.25,
  ncores = 1,
  sparse = FALSE,
  eps = 1e-06,
  cc = 4.7,
  cv_k = 5,
  cv_repl = 1,
  cl = NULL,
  cv_metric = c("tau_size", "mape", "rmspe"),
  add_zero_based = TRUE,
  explore_solutions = 10,
  explore_tol = 0.1,
  max_solutions = 10,
  fit_all = TRUE,
  comparison_tol = sqrt(eps),
  algorithm_opts = mm_algorithm_options(),
  mscale_opts = mscale_algorithm_options(),
  x_train,
  y_train,
  ...
)

Arguments

x

either a numeric matrix of predictor values, or a cross-validated PENSE fit from pense_cv().

...

ignored. See the section on deprecated parameters below.

y

vector of response values of length n. For binary classification, y should be a factor with 2 levels.

alpha

elastic net penalty mixing parameter with \(0 \le \alpha \le 1\). alpha = 1 is the LASSO penalty, and alpha = 0 the Ridge penalty.

nlambda

number of penalization levels.

lambda_min_ratio

Smallest value of the penalization level as a fraction of the largest level (i.e., the smallest value for which all coefficients are zero). The default depends on the sample size relative to the number of variables and alpha. If more observations than variables are available, the default is 1e-3 * alpha, otherwise 1e-2 * alpha.

lambda_m, lambda_s

optional user-supplied sequence of penalization levels for the S- and M-estimates. If given and not NULL, nlambda and lambda_min_ratio are ignored for the respective estimate (S and/or M).

standardize

logical flag to standardize the x variables prior to fitting the PENSE estimates. Coefficients are always returned on the original scale. This can fail for variables with a large proportion of a single value (e.g., zero-inflated data). In this case, either compute with standardize = FALSE or standardize the data manually.

penalty_loadings

a vector of positive penalty loadings (a.k.a. weights) for different penalization of each coefficient. Only allowed for alpha > 0.

intercept

include an intercept in the model.

bdp

desired breakdown point of the estimator, between 0 and 0.5.

ncores

number of CPU cores to use in parallel. By default, only one CPU core is used. May not be supported on your platform, in which case a warning is given.

sparse

use sparse coefficient vectors.

eps

numerical tolerance.

cc

cutoff constant for Tukey's bisquare \(\rho\) function in the M-estimation objective function.

cv_k

number of folds per cross-validation.

cv_repl

number of cross-validation replications.

cl

a parallel cluster. Can only be used if ncores = 1, because multi-threading can not be used in parallel R sessions on the same host.

cv_metric

either a string specifying the performance metric to use, or a function to evaluate prediction errors in a single CV replication. If a function, the number of arguments define the data the function receives. If the function takes a single argument, it is called with a single numeric vector of prediction errors. If the function takes two or more arguments, it is called with the predicted values as first argument and the true values as second argument. The function must always return a single numeric value quantifying the prediction performance. The order of the given values corresponds to the order in the input data.

add_zero_based

also consider the 0-based regularization path. See details for a description.

explore_solutions

number of solutions to compute up to the desired precision eps.

explore_tol

numerical tolerance for exploring possible solutions. Should be (much) looser than eps to be useful.

max_solutions

only retain up to max_solutions unique solutions per penalization level.

fit_all

If TRUE, fit the model for all penalization levels. Otherwise, only at penalization level with smallest average CV performance.

comparison_tol

numeric tolerance to determine if two solutions are equal. The comparison is first done on the absolute difference in the value of the objective function at the solution If this is less than comparison_tol, two solutions are deemed equal if the squared difference of the intercepts is less than comparison_tol and the squared \(L_2\) norm of the difference vector is less than comparison_tol.

algorithm_opts

options for the MM algorithm to compute the estimates. See mm_algorithm_options() for details.

mscale_opts

options for the M-scale estimation. See mscale_algorithm_options() for details.

nlambda_enpy

number of penalization levels where the EN-PY initial estimate is computed.

enpy_opts

options for the ENPY initial estimates, created with the enpy_options() function. See enpy_initial_estimates() for details.

scale

initial scale estimate to use in the M-estimation. By default the S-scale from the PENSE fit is used.

x_train, y_train

override arguments x and y as provided in the call to pense_cv(). This is useful if the arguments in the pense_cv() call are not available in the current environment.

Value

an object of cross-validated regularized M-estimates as returned from regmest_cv().

Details

The built-in CV metrics are

"tau_size"

\(\tau\)-size of the prediction error, computed by tau_size() (default).

"mape"

Median absolute prediction error.

"rmspe"

Root mean squared prediction error.

"auroc"

Area under the receiver operator characteristic curve (actually 1 - AUROC). Only sensible for binary responses.

See also

pense_cv() to compute the starting S-estimate.

Other functions to compute robust estimates with CV: pense_cv(), regmest_cv()