Structure of problem instance

The package is organized as follow :

There is a main class called classo_problem, that contains a lot of information about the problem, and once the problem is solved, it will also contains the solution.

Here is the global structure of the problem instance:

A classo_problem instance contains a Data instance, a Formulation instance, a Model_selection instance and a Solution instance.

A Model_selection instance contains the instances : PATHparameters, CVparameters, StabSelparameters, LAMfixedparameters.

A Solution instance, once is computed, contains the instances : solution_PATH, solution_CV, solution_StabSel, solution_LAMfixed.

Classes

classo_problem(X, y[, C, Tree, label])

Class that contains all the information about the problem.

classo_problem.solve()

Method that solves every model required in the attributes of the problem instance and update the attribute solution with the characteristics of the solution.

Data(X, y, C[, Tree, label])

Class that contains the data of the problem ie where matrices and labels are stored.

Formulation()

Class that contains the information about the formulation of the problem namely, the type of formulation (R1, R2, R3, R4, C1, C2) and its parameters like rho, the weigths and the presence of an intercept.

Model_selection([method])

Class that contains information about the model selections to perform.

PATHparameters([method])

Class that contains the parameters to compute the lasso-path.

CVparameters([method])

Class that contains the parameters to compute the cross-validation.

StabSelparameters([method])

Class that contains the parameters to compute the stability selection.

LAMfixedparameters([method])

Class that contains the parameters to compute the lasso for a fixed lambda.

Solution()

Class that contains characteristics of the solution of the model_selections that are computed Before using the method solve() , its componant are empty/null.

solution_PATH(matrices, param, formulation, …)

Class that contains characteristics of the lasso-path computed, which also contains representation method that plot the graphic of this lasso-path.

solution_ALO(matrices, param, formulation, …)

Class that contains characteristics of the lasso-path computed, which also contains representation method that plot the graphic of this lasso-path.

solution_CV(matrices, param, formulation, …)

Class that contains characteristics of the cross validation computed, which also contains a representation method that plot the selected parameters and the solution of the not-sparse problem on the selected variables set.

solution_CV.graphic([se_max, save, …])

Method to plot the graphic showing mean squared error over along lambda path once cross validation is computed.

solution_StabSel(matrices, param, …)

Class that contains characteristics of the stability selection computed, which also contains a representation method that plot the selected parameters, the solution of the not-sparse problem on the selected variables set, and the stability plot.

solution_LAMfixed(matrices, param, …)

Class that contains characteristics of the lasso computed which also contains a representation method that plot this solution.

Class classo_problem

class classo.solver.classo_problem(X, y, C=None, Tree=None, label=None)

Class that contains all the information about the problem. It also has a representation method so one can print it.

Parameters
  • X (ndarray) – Matrix representing the data of the problem.

  • y (ndarray) – Vector representing the output of the problem.

  • C (str or ndarray, optional) – Matrix of constraints to the problem. If it is ‘zero-sum’ then the corresponding attribute will be all-one matrix. Default value : ‘zero-sum’

  • label (list,optional) – list of the labels of each variable. If None, then label are just indices. Default value : None

data

object containing the data (matrices) of the problem. Namely : X, y, C and the labels.

Type

Data

formulation

object containing the info about the formulation of the minimization problem we solve.

Type

Formulation

model_selection

object containing the parameters we need to do variable selection.

Type

Model_selection

solution

object giving caracteristics of the solution of the model_selection that is asked. Before using the method solve() , its componant are empty/null.

Type

Solution

numerical_method

name of the numerical method that is used, it can be : ‘Path-Alg’ (path algorithm) , ‘P-PDS’ (Projected primal-dual splitting method) , ‘PF-PDS’ (Projection-free primal-dual splitting method) or ‘DR’ (Douglas-Rachford-type splitting method). Default value : ‘not specified’, which means that the function choose_numerical_method() will choose it accordingly to the formulation.

Type

str

classo_problem.solve()

Method that solves every model required in the attributes of the problem instance and update the attribute solution with the characteristics of the solution.

classo.solver.choose_numerical_method(method, model, formulation, StabSelmethod=None, lam=None)

Annex function in order to choose the right numerical method, if the given one is invalid. In general, it will choose one of the possible optimization scheme for a given formulation. When several computation modes are possible, the rules are as follow :

If possible, always use “Path-Alg”, except for fixed lambdas smaller than 0.05 and for R4 where Path-Alg does not compute the path (paradoxically).

Else, it uses “DR”.

Parameters
  • method (str) – input method that is possibly wrong and should be changed.

  • the method is valid for this formulation (If) –

  • will not be changed. (it) –

  • model (str) – Computation mode. Can be “PATH”, “StabSel”, “CV” or “LAM”.

  • formulation (Formulation) – object containing the info about the formulation of the minimization problem we solve.

  • StabSelmethod (str, optional) – if model is “StabSel”, it can be “first” , “lam” or “max”.

  • lam (float, optional) – value of lam (fractional L1 penalty).

Returns :

str : method that should be used. Can be “Path-Alg”, “DR”, “P-PDS” or “PF-PDS”

Class Data

class classo.solver.Data(X, y, C, Tree=None, label=None)

Class that contains the data of the problem ie where matrices and labels are stored.

Parameters
  • X (ndarray) – Matrix representing the data of the problem.

  • y (ndarray) – Vector representing the output of the problem.

  • C (str or array, optional) – Matrix of constraints to the problem. If it is ‘zero-sum’ then the corresponding attribute will be all-one matrix.

  • label (list, optional) – list of the labels of each variable. If None, then labels are juste the indices. Default value : None

  • Tree (skbio.TreeNode, optional) – taxonomic tree, if not None, then the matrices X and C and the labels will be changed.

X

Matrix representing the data of the problem.

Type

ndarray

y

Vector representing the output of the problem.

Type

ndarray

C

Matrix of constraints to the problem. If it is ‘zero-sum’ then the corresponding attribute will be all-one matrix.

Type

str or array, optional

label

list of the labels of each variable. If None, then labels are juste the indices.

Type

list

tree

taxonomic tree.

Type

skbio.TreeNode or None

Class Formulation

class classo.solver.Formulation

Class that contains the information about the formulation of the problem namely, the type of formulation (R1, R2, R3, R4, C1, C2) and its parameters like rho, the weigths and the presence of an intercept. The type of formulation is encoded with boolean huber concomitant and classification with the rule:

False False False = R1

True False False = R2

False True False = R3

True True False = R4

False False True = C1

True False True = C2

It also has a representation method so one can print it.

huber

True if the formulation of the problem should be robust. Default value : False

Type

bool

concomitant

True if the formulation of the problem should be with an M-estimation of sigma. Default value : True

Type

bool

classification

True if the formulation of the problem should be classification (if yes, then it will not be concomitant). Default value : False

Type

bool

rho

Value of rho for R2 and R4 formulations. Default value : 1.345

Type

float

scale_rho

If set to True, it will become rho * sqrt( mean( y**2 ) ) while solving the problem so that it lives on the scale of y and also usefull so that we don’t have the problem with the non strict convexity (i.e. at least one sample is on the quadratic mode of the huber loss function) as long as rho is higher than one. Default value : True

Type

bool

rho_scaled

Actual rho after solving Default value : Not defined

Type

float

rho_classification

value of rho for huberized hinge loss function for classification ie C2 (it has to be strictly smaller then 1). Default value : -1.

Type

float

e

value of e in concomitant formulation. If ‘n/2’ then it becomes n/2 during the method solve(), same for ‘n’. Default value : ‘n’ if huber formulation ; ‘n/2’ else

Type

float or string

w

array of size d with the weights of the L1 penalization. This has to be positive. Default value : None (which makes it the 1,…,1 vector)

Type

numpy ndarray

intercept

set to true if we should use an intercept. Default value : False

Type

bool

Class Model_selection

class classo.solver.Model_selection(method='not specified')

Class that contains information about the model selections to perform. It contains boolean that states which one will be computed. It also contains objects that contain parameters of each computation modes. It also has a representation method so one can print it.

PATH

True if path should be computed. Default value : False

Type

bool

PATHparameters

object containing parameters to compute the lasso-path.

Type

PATHparameters

ALO

True if path should be computed. Default value : False

Type

bool

ALOparameters

object containing parameters to compute the ALO for c-lasso.

Type

ALOparameters

CV

True if Cross Validation should be computed. Default value : False

Type

bool

CVparameters

object containing parameters to compute the cross-validation.

Type

CVparameters

StabSel

True if Stability Selection should be computed. Default value : True

Type

boolean

StabSelparameters

object containing parameters to compute the stability selection.

Type

StabSelparameters

LAMfixed

True if solution for a fixed lambda should be computed. Default value : False

Type

boolean

LAMfixedparameters

object containing parameters to compute the lasso for a fixed lambda.

Type

LAMfixedparameters

Classes used in Model_selection

class classo.solver.PATHparameters(method='not specified')

Class that contains the parameters to compute the lasso-path. It also has a representation method so one can print it.

numerical_method

name of the numerical method that is used, it can be : ‘Path-Alg’ (path algorithm) , ‘P-PDS’ (Projected primal-dual splitting method), ‘PF-PDS’ (Projection-free primal-dual splitting method) or ‘DR’ (Douglas-Rachford-type splitting method). Default value : ‘not specified’, which means that the function choose_numerical_method() will choose it accordingly to the formulation

Type

str

n_active

if it is higher than 0, then the algo stops computing the path when n_active variables are active.

Type

int

Then the solution does not change from this point.

Default value : 0

lambdas

list of rescaled lambdas for computing lasso-path. Default value : None, which means line space between 1 and lamin and Nlam points, with logarithm scale or not depending on logscale.

Type

numpy.ndarray

Nlam

number of points in the lambda-path if lambdas is still None (default). Default value : 80

Type

int

lamin

lambda minimum if lambdas is still None (default). Default value : 1e-3

Type

float

logscale

when lambdas is set to None (default), this parameters tells if it should be set with log scale or not. Default value : True

Type

bool

plot_sigma

if True then the representation method of the solution will also plot the sigma-path if it is computed (formulation R3 or R4). Default value : True

Type

bool

label

labels on each coefficient.

Type

numpy.ndarray of str

class classo.solver.ALOparameters(method='not specified')

Class that contains the parameters to compute the lasso-path, then the Approximation of Leave one-out error. It also has a representation method so one can print it.

numerical_method

name of the numerical method that is used, it can be : ‘Path-Alg’ (path algorithm) , ‘P-PDS’ (Projected primal-dual splitting method), ‘PF-PDS’ (Projection-free primal-dual splitting method) or ‘DR’ (Douglas-Rachford-type splitting method). Default value : ‘not specified’, which means that the function choose_numerical_method() will choose it accordingly to the formulation

Type

str

n_active

if it is higher than 0, then the algo stops computing the path when n_active variables are active.

Type

int

Then the solution does not change from this point.

Default value : 0

lambdas

list of rescaled lambdas for computing lasso-path. Default value : None, which means line space between 1 and lamin and Nlam points, with logarithm scale or not depending on logscale.

Type

numpy.ndarray

Nlam

number of points in the lambda-path if lambdas is still None (default). Default value : 80

Type

int

lamin

lambda minimum if lambdas is still None (default). Default value : 1e-3

Type

float

logscale

when lambdas is set to None (default), this parameters tells if it should be set with log scale or not. Default value : True

Type

bool

plot_sigma

if True then the representation method of the solution will also plot the sigma-path if it is computed (formulation R3 or R4). Default value : True

Type

bool

label

labels on each coefficient.

Type

numpy.ndarray of str

class classo.solver.CVparameters(method='not specified')

Class that contains the parameters to compute the cross-validation. It also has a representation method so one can print it.

seed

Seed for random values, for an equal seed, the result will be the same. If set to False/None: pseudo-random seed. Default value : 0

Type

bool or int, optional

numerical_method

name of the numerical method that is used, can be : ‘Path-Alg’ (path algorithm) , ‘P-PDS’ (Projected primal-dual splitting method), ‘PF-PDS’ (Projection-free primal-dual splitting method) or ‘DR’ (Douglas-Rachford-type splitting method). Default value : ‘not specified’, which means that the function choose_numerical_method() will choose it accordingly to the formulation.

Type

str

lambdas

list of rescaled lambdas for computing lasso-path. Default value : None, which means line space between 1 and lamin and Nlam points, with logarithm scale or not depending on logscale.

Type

numpy.ndarray

Nlam

number of points in the lambda-path if lambdas is still None (default). Default value : 80

Type

int

lamin

lambda minimum if lambdas is still None (default). Default value : 1e-3

Type

float

logscale

when lambdas is set to None (default), this parameters tells if it should be set with log scale or not. Default value : True

Type

bool

oneSE

if set to True, the selected lambda is computed with method ‘one-standard-error’. Default value : True

Type

bool

Nsubset

number of subset in the cross validation method. Default value : 5

Type

int

class classo.solver.StabSelparameters(method='not specified')

Class that contains the parameters to compute the stability selection. It also has a representation method so one can print it.

seed

Seed for random values, for an equal seed, the result will be the same. If set to False/None: pseudo-random seed. Default value : 123

Type

bool or int, optional

numerical_method

name of the numerical method that is used, can be : ‘Path-Alg’ (path algorithm) , ‘P-PDS’ (Projected primal-dual splitting method) , ‘PF-PDS’ (Projection-free primal-dual splitting method) or ‘DR’ (Douglas-Rachford-type splitting method). Default value : ‘not specified’, which means that the function choose_numerical_method() will choose it accordingly to the formulation.

Type

str

lam

(only used if method = ‘lam’) lam for which the lasso should be computed. Default value : ‘theoretical’ which mean it will be equal to theoretical_lam once it is computed.

Type

float or str

rescaled_lam

(only used if method = ‘lam’) False if lam = lambda, False if lam = lambda/lambdamax which is between 0 and 1. If False and lam = ‘theoretical’ , then it will take the value n*theoretical_lam. Default value : True

Type

bool

theoretical_lam

(only used if method = ‘lam’) Theoretical lam. Default value : 0.0 (once it is not computed yet, it is computed thanks to the function theoretical_lam() used in classo_problem.solve()).

Type

float

method

‘first’, ‘lam’ or ‘max’ depending on the type of stability selection we do. Default value : ‘first’

Type

str

B

number of subsample considered. Default value : 50

Type

int

q

number of selected variable per subsample. Default value : 10

Type

int

percent_nS

size of subsample relatively to the total amount of sample. Default value : 0.5

Type

float

lamin

lamin when computing the lasso-path for method ‘max’. Default value : 1e-2

Type

float

hd

if set to True, then the ‘max’ will stop when it reaches n-k actives variables. Default value : False

Type

bool

threshold

threshold for stability selection. Default value : 0.7

Type

float

threshold_label

threshold to know when the label should be plot on the graph. Default value : 0.4

Type

float

class classo.solver.LAMfixedparameters(method='not specified')

Class that contains the parameters to compute the lasso for a fixed lambda. It also has a representation method so one can print it.

numerical_method

name of the numerical method that is used, can be : ‘Path-Alg’ (path algorithm) , ‘P-PDS’ (Projected primal-dual splitting method) , ‘PF-PDS’ (Projection-free primal-dual splitting method) or ‘DR’ (Douglas-Rachford-type splitting method). Default value : ‘not specified’, which means that the function choose_numerical_method() will choose it accordingly to the formulation

Type

str

lam

lam for which the lasso should be computed. Default value : ‘theoretical’ which mean it will be equal to theoretical_lam once it is computed

Type

float or str

rescaled_lam

False if lam = lambda, True if lam = lambda/lambdamax which is between 0 and 1. If False and lam = ‘theoretical’ , then it will takes the value n*theoretical_lam. Default value : True

Type

bool

theoretical_lam

Theoretical lam. Default value : 0.0 (once it is not computed yet, it is computed thanks to the function theoretical_lam() used in classo_problem.solve()).

Type

float

threshold

Threshold such that the parameters i selected or the ones such as the absolute value of beta[i] is greater than the threshold. If None, then it will be set to the average of the absolute value of beta. Default value : None

Type

float

Class Solution

class classo.solver.Solution

Class that contains characteristics of the solution of the model_selections that are computed Before using the method solve() , its componant are empty/null. It also has a representation method so one can print it.

PATH

Solution components of the model PATH.

Type

solution_PATH

CV

Solution components of the model CV.

Type

solution_CV

StabelSel

Solution components of the model StabSel.

Type

solution_StabSel

LAMfixed

Solution components of the model LAMfixed.

Type

solution_LAMfixed

Classes used in Solution

class classo.solver.solution_PATH(matrices, param, formulation, numerical_method, label)

Class that contains characteristics of the lasso-path computed, which also contains representation method that plot the graphic of this lasso-path.

BETAS

array of size Npath x d with the solution beta for each lambda on each row.

Type

numpy.ndarray

SIGMAS

array of size Npath with the solution sigma for each lambda when the formulation of the problem is R2 or R4.

Type

numpy.ndarray

LAMBDAS

array of size Npath with the lambdas (real lambdas, not divided by lambda_max) for which the solution is computed.

Type

numpy.ndarray

logscale

whether or not the path should be plotted with a logscale.

Type

bool

method

name of the numerical method that has been used. It can be ‘Path-Alg’, ‘P-PDS’ , ‘PF-PDS’ or ‘DR’.

Type

str

save

if it is a str, then it gives the name of the file where the graphics has been/will be saved (after using print(solution) ).

Type

bool or str

formulation

object containing the info about the formulation of the minimization problem we solve.

Type

Formulation

time

running time of this action.

Type

float

class classo.solver.solution_ALO(matrices, param, formulation, numerical_method, label)

Class that contains characteristics of the lasso-path computed, which also contains representation method that plot the graphic of this lasso-path.

BETAS

array of size Npath x d with the solution beta for each lambda on each row.

Type

numpy.ndarray

SIGMAS

array of size Npath with the solution sigma for each lambda when the formulation of the problem is R2 or R4.

Type

numpy.ndarray

LAMBDAS

array of size Npath with the lambdas (real lambdas, not divided by lambda_max) for which the solution is computed.

Type

numpy.ndarray

logscale

whether or not the path should be plotted with a logscale.

Type

bool

method

name of the numerical method that has been used. It can be ‘Path-Alg’, ‘P-PDS’ , ‘PF-PDS’ or ‘DR’.

Type

str

save1,save2

if a string is given, the corresponding graph will be saved with the given name of the file. save1 is for the path plot ; save2 for ALO plot ; and save3 for refit beta-solution.

Type

bool or string

formulation

object containing the info about the formulation of the minimization problem we solve.

Type

Formulation

time

running time of this action.

Type

float

class classo.solver.solution_CV(matrices, param, formulation, numerical_method, label)

Class that contains characteristics of the cross validation computed, which also contains a representation method that plot the selected parameters and the solution of the not-sparse problem on the selected variables set.

xGraph

array of size Nlam of the lambdas / lambda_max.

Type

numpy.ndarray

yGraph

array of size Nlam of the average validation residual (over the K subsets).

Type

numpy.ndarray

standard_error

array of size Nlam of the standard error of the validation residual (over the K subsets).

Type

numpy.ndarray

logscale

whether or not the path should be plotted with a logscale.

Type

bool

index_min

index on xGraph of the selected lambda without 1-standard-error method.

Type

int

index_1SE

index on xGraph of the selected lambda with 1-standard-error method.

Type

int

lambda_min

selected lambda without 1-standard-error method.

Type

float

lambda_oneSE

selected lambda with 1-standard-error method.

Type

float

beta

solution beta of classo at lambda_oneSE/lambda_min depending on CVparameters.oneSE.

Type

numpy.ndarray

sigma

solution sigma of classo at lambda_oneSE when formulation is ‘R2’ or ‘R4’.

Type

float

selected_param

boolean arrays of size d with True when the variable is selected.

Type

numpy.ndarray

refit

solution beta after solving unsparse problem over the set of selected variables.

Type

numpy.ndarray

save1,save2

if a string is given, the corresponding graph will be saved with the given name of the file. save1 is for CV curve ; and save2 for refit beta-solution.

Type

bool or string

formulation

object containing the info about the formulation of the minimization problem we solve.

Type

Formulation

time

running time of this action.

Type

float

solution_CV.graphic(se_max=None, save=None, logscale=True, errorevery=5)

Method to plot the graphic showing mean squared error over along lambda path once cross validation is computed.

Parameters
  • se_max (float) – float thanks to which the graphic will not show the lambdas from which MSE(lambda)> min(MSE) + se_max * Standard_error(lambda_min). this parameter is useful to plot a graph that zooms in the interesting part. Default value : None

  • logScale (bool) – input that tells to plot the mean square error as a function of lambda, or log10(lambda) Default value : True

  • errorevery (int) – parameter input of matplotlib.pyplot.errorbar that gives the frequency of the error bars appearence. Default value : 5

  • save (string) – path to the file where the figure should be saved. If None, then the figure will not be saved. Default value : None

class classo.solver.solution_StabSel(matrices, param, formulation, numerical_method, label)

Class that contains characteristics of the stability selection computed, which also contains a representation method that plot the selected parameters, the solution of the not-sparse problem on the selected variables set, and the stability plot.

distribution

d array of stability ratios.

Type

array

lambdas_path

for ‘first’ method : Nlam array of the lambdas used. Other cases : ‘not used’.

Type

array or string

distribution_path

for ‘first’ method : Nlam x d array with stability ratios as a function of lambda.

Type

array or string

threshold

threshold for StabSel, ie for a variable i, stability ratio that is needed to get selected.

Type

float

save1,save2

if a string is given, the corresponding graph will be saved with the given name of the file. save1 is for stability plot ; and save2 for refit beta-solution.

Type

bool or string

selected_param

boolean arrays of size d with True when the variable is selected.

Type

numpy.ndarray

to_label

boolean arrays of size d with True when the name of the variable should be seen on the graph.

Type

numpy.ndarray

refit

solution beta after solving unsparse problem over the set of selected variables.

Type

numpy.ndarray

formulation

object containing the info about the formulation of the minimization problem we solve.

Type

Formulation

time

running time of this action.

Type

float

class classo.solver.solution_LAMfixed(matrices, param, formulation, numerical_method, label)

Class that contains characteristics of the lasso computed which also contains a representation method that plot this solution.

lambdamax

lambda maximum for which the solution is non-null.

Type

float

rescaled_lam

if True, the problem had been computed for lambda*lambdamax (so lambda should be between 0 and 1).

Type

bool

lamb

lambda for which the problem is solved.

Type

float

beta

solution beta of classo.

Type

numpy.ndarray

sigma

solution sigma of classo when formulation is ‘R2’ or ‘R4’.

Type

float

selected_param

boolean arrays of size d with True when the variable is selected (which is the case when the i-th component solution of the classo is non-null).

Type

numpy.ndarray

refit

solution beta after solving unsparse problem over the set of selected variables.

Type

numpy.ndarray

formulation

object containing the info about the formulation of the minimization problem we solve.

Type

Formulation

time

running time of this action.

Type

float