Defining the Parameter Configuration Space

This pages explains in detail how the types, allowed and default values of all parameters are defined. It also covers how dependencies and constraints between them can be modeled.

Parameter Definitions

The definition of the parameters is stored in a dictionary, where the name of the parameter as the key, and a tuple containing all necessary information constitutes the corresponding value. SMAC, the configuration system underlying pySMAC, discriminates four different parameter types:

  1. real – take any value in a specified range.
  2. integer – take any integral value in a specified range.
  3. categorical – take one of a finite set of values.
  4. ordinal – are essentially categorical parameters with a natural ordering between the elements.

We shall look at a simple example for all four types to show how they are defined:

parameter_definitions=dict(\
a_float_parameter =       ("real"       , [-3.5, 2.48]                             , 1.1           ),
an_integer_parameter =    ("integer"    , [1, 1000]                                ,  2      ,"log"),
a_categorical_parameter = ("categorical", ["yes", "no", "maybe"]                   , "yes"         ),
a_ordinal_parameter =     ("ordinal"    , ["cold", "cool", "medium", "warm", "hot"], "medium"      )
)

The definition of each parameter follows the same pattern: first its type, followed by a list defining the allowed values, and finally the default value.

For real and integer type, the allowed values are defined by a range, represented by a list with exactly two elements. The default value has to be inside this range to yield a legal definition. Both the range and the default of an integer have to be Python ints

There exists an optional flag “log” that can be given additionally as the last element of a tuple for these two types. If given, the parameter is varied on a logarithmic scale, meaning that the logarithm of the value is uniformly distributed between the logarithm of the bounds. Therefore, this option can only be given if the parameter is strictly positive!

For categorical and ordinal the list of allowed values can contain any number (>0) of elements. Every element constitutes a valid value for this parameter. The default value has to be among them. The ordering of an ordinal parameter is established by the order in the list.

The Python type for categorical and ordinal parameters can be a numeric type of a string. The only restriction is, that all allowed values for one parameter are all of the same time. For example, the following definition is not valid:

parameter_definitions=dict(\
        a = ("integer"    , [1, 1000] ,  2.0 ),       # default value is not of type int
        b = ("categorical", ["True", "False",1], 1),  # 2 str and one int value
)

Note

Defining the parameter configuration space can be quite challenging, e.g., if the number of parameters is large. So typos and/or inconsistencies can happen. SMAC itself checks the definitions in great detail, and pySMAC provides also some sanity checks with (hopefully) helpful error messages to assist in this tedious task.

Conditional Parameter Clauses

In many cases, certain parameters only have any meaning if another one takes a certain value. For example, one parameter might activate or deactivate a subroutine which has parameters itself. Naturally, the latter are only relevant when the subroutine is actually used. These dependencies can be expressed in pySMAC to accelerate the configuration process (by reducing the number of active parameters).

To illustrate this, let’s focus on the example in sklearn_model_selection.py. The script demonstrates a use-case from machine learning. Given a data set, there are numerous models that can be used to learn from it and apply this knowledge to unseen data points. In this simple example, we generate a random data set and try to find the model (with the best parameter settings) among only three very basic ones, namely: k-nearest-neighbors, random forests, and extremely randomized trees. In order to run that example, you need the scikit-learn package, but the source code below should be illustrative enough to show how to use conditionals.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
from __future__ import print_function, division

import pysmac

import sklearn.ensemble
import sklearn.neighbors
import sklearn.datasets
import sklearn.cross_validation

# We use a random classification data set generated by sklearn
# As commonly done, we use a train-test split to avoid overfitting.
X,Y = sklearn.datasets.make_classification(1000, 20)
X_train, X_test, Y_train, Y_test = \
	sklearn.cross_validation.train_test_split(X,Y, test_size=0.33, random_state=1)

# Here, SMAC can choose between to different models at each evaluation. To
# make the search more efficient, it is important to tell SMAC that some
# parameters are associated with certain classifiers
def choose_classifier(classifier, # which classifier to use
				# parameters for the tree based classifiers
                trees_n_estimators = None, trees_criterion = None, 
                trees_max_features = None, trees_max_depth = None,
                # the ones for k-nearest-neighbors
                knn_n_neighbors=None, knn_weights=None):
				#note that possibly inactive variables have to be optional
				#as pysmac does not assign a value for inactive variables
				#during the minimization phase
    if classifier == 'random_forest':
        predictor = sklearn.ensemble.RandomForestClassifier(
						trees_n_estimators, trees_criterion,
						trees_max_features, trees_max_depth)
    elif classifier == 'extra_trees':
        predictor = sklearn.ensemble.ExtraTreesClassifier(
						trees_n_estimators, trees_criterion,
						trees_max_features, trees_max_depth)
    elif classifier == 'k_nearest_neighbors':
        predictor = sklearn.neighbors.KNeighborsClassifier(
						knn_n_neighbors, knn_weights)

    predictor.fit(X_train, Y_train)
    return -predictor.score(X_test, Y_test)

# defining all the parameters with respective defaults.
parameter_definition=dict(\
	trees_max_depth =  ("integer", [1,10],  4),
	trees_max_features=("integer", [1,20], 10),
	trees_n_estimators=("integer", [1,100],10 ,'log'),          
	trees_criterion =("categorical", ['gini', 'entropy'], 'entropy'),
	knn_n_neighbors = ("integer", [1,100], 10, 'log'),
	knn_weights = ("categorical", ['uniform', 'distance'], 'uniform'),
	classifier  = ("ordinal", ['random_forest','extra_trees' ,'k_nearest_neighbors'], 'random_forest'),
	# Usually you would make this a categorical, but to showcase all 
	# conditional clauses, let's pretend it's an ordinal parameter,
	# so we can use > and <.
	)

# here we define the dependencies between the parameters. the notation is
#   <child> | <parent> in { <parent value>, ... }
# and means that the child parameter is only active if the parent parameter
# takes one of the value in the listed set. The notation follows the SMAC
# manual one to one. Note there is no checking for correctness beyond
# what SMAC does. I.e., when you have a typo in here, you don't get any 
# meaningful output, unless you set  debug = True below!
conditionals = [ 'trees_max_depth    | classifier in {random_forest, extra_trees}',
                 'trees_max_features | classifier in {random_forest} || classifier == extra_trees',
                 'trees_n_estimators | classifier != k_nearest_neighbors',
                 'trees_criterion    | classifier < k_nearest_neighbors',
                 'knn_n_neighbors    | classifier > extra_trees',
                 'knn_weights        | classifier == k_nearest_neighbors && classifier != extra_trees && classifier != random_forest'
                ]

# creation of the SMAC_optimizer object. Notice the optional debug flag
opt = pysmac.SMAC_optimizer( debug = 0,
							 working_directory = '/tmp/pysmac_test/', persistent_files=True, )

# first we try the sklearn default, so we can see if SMAC can improve the performance

predictor = sklearn.ensemble.RandomForestClassifier()
predictor.fit(X_train, Y_train)
print('The default accuracy of the random forest is %f'%predictor.score(X_test, Y_test))

predictor = sklearn.ensemble.ExtraTreesClassifier()
predictor.fit(X_train, Y_train)
print('The default accuracy of the extremely randomized trees is %f'%predictor.score(X_test, Y_test))

predictor = sklearn.neighbors.KNeighborsClassifier()
predictor.fit(X_train, Y_train)
print('The default accuracy of k-nearest-neighbors is %f'%predictor.score(X_test, Y_test))


# The minimize method also has optional arguments (more on that in the section on advanced configuration).
value, parameters = opt.minimize(choose_classifier,
                    500 , parameter_definition,
                    conditional_clauses = conditionals)

print('The highest accuracy found: %f'%(-value))
print('Parameter setting %s'%parameters)

The output looks like that (note the random data set leads to slightly different numbers for every run):

The default accuracy of the random forest is 0.909091
The default accuracy of the extremely randomized trees is 0.903030
The default accuracy of k-nearest-neighbors is 0.863636
The highest accuracy found: 0.936364
Parameter setting {'knn_weights': 'distance', 'trees_n_estimators': '8', 'knn_n_neighbors': '1', 'classifier': 'random_forest', 'trees_max_features': '10', 'trees_max_depth': '2', 'trees_criterion': 'gini'}

The script shows how pySMAC can be used for model selection and simultaneous optimization. The function to be minimized (choose_classifier, line 19) returns the negative accuracy of training one out of three machine learning models (a random forest, extremely randomized trees, and k-nearest-neighbors). So effectively, SMAC is asked to maximize the accuracy choosing either of these models and its respective parameters.

The parameter definitions are stated between line 44 and 55. Naturally, the ones for K-nearest-neighbors and the two tree based classifiers are independent. Therefore, the parameters of the former affect the accuracy only if this classifier is actually chosen.

The variable conditionals, defined in line 64, shows some examples for how these dependencies between parameters are expressed. Generally they follow the template:

child_name | condition1 (&& or ||) condition2 (&& or ||) ...

The child variable is only considered active if the logic expression following the “|” is true.

Note

From the SMAC manual
  • Parameters not listed as a child in any conditional parameter clause are always active.
  • A child’s name can appear only once.
  • There is no support for parenthesis with conditionals. The && connective has higher precedence than ||, so a||b&& c||d is the same as a||(b&&c)||d.

The conditions can take different forms:

parent_name in {value1, value2, ... }
parent_name == value                      parent_name != value
parent_name <  value                      parent_name >  value

The first one is true if the parent takes any of the values listed. The other expressions have the regular meaning. The operators in the last line are only legal for real, integer or ordinal, while the others can be used with any type.

Forbidden Parameter Clauses

In some use-cases, certain parameter configurations might be illegal, or lead to undefined behavior. For example, some algorithms might be able to employ different data structures, and different subroutines, controlled by two parameters:

parameter_definition = dict(\
        DS = ("categorical", [DataStructure1, DataStructure2, DataStructure3],DataStructure1 ),
        SR = ("categorical", [SubRoutine1, SubRoutine2, SubRoutine3], SubRoutine1)
)

Let’s assume that DataStructure2 is incompatible with SubRoutine3, i.e. evaluating this combination does not yield a meaningful result, or might even cause a system crash. That means one out of nine possible choices for these two parameters is forbidden.

One can certainly change the parameters and their definitions such that they exclude this case explicitly. One could, i.e., combine the two parameters and list all eight allowed values:

parameter_definition = dict(\
        DS_SR =  ("categorical", [DS1_SR1, DS1_SR2, DS1_SR2, DS2_SR1, DS2_SR2, DS3_SR1, DS3_SR2, DS3_SR3], [DS1_SR1])
)

This is not only unpractical, but it forbids SMAC to learn about the data structures and subroutines independently. It is much more efficient to specifically exclude this one combination by defining a forbidden parameter clause. The classic syntax is as follows:

"{parameter_name1 = value1, ..., parameter_nameN = ValueN}"

It allows to specify combinations of values that are forbidden. For our example above, the appropriate forbidden clause would be

forbidden_confs = ["{DS = DataStructure2, SR = SubRoutine3}"]

Note

The pair of curly braces {} around the expression is mandatory. The pySMAC notation here is a direct copy of the SMAC one. These strings are merely handed over to SMAC without any processing. That way, statements from the SMAC manual are applicable to pySMAC as well.

A list of all forbidden clauses is than passed to the minimize method with the forbidden_clauses keyword. So the corresponding call to the minimize method would look like this:

opt = pysmac.SMAC_optimizer()
value, config = opt.minimize(function_to_minimize, num_function_calls,
                             parameter_definition,
                             forbidden_clasuses = forbidden_confs)

Of course, conditionals and forbidden clauses are not mutual exclusive, but you can define both and use them while minimizing the function.

Introduced in SMAC 2.10, there is an advanced syntax that allows more complex situations to be handled. It allows to compare parameter values to each other, and apply a limit set of functions:

Arithmetic Operations +,-,*,/,^, Unary +/-, and %
Functions abs, (a)cos, (a)sin, (a)tan, exp, sinh, cosh, ceil, floor, log, log2, log10, sqrt, cbrt
Logical Operators >=, <=, >, <, ==, !=, ||, &&

Some examples (without the appropriate parameter definitions) to illustrate this notation:

forbidden_confs = ["{activity == 'swimming' && water_temperature < 15}",
                   "{x^2 + y^2  < 1}",
                   "{sqrt(a) + log10(b) >= - exp(2)}"]

Note

The SMAC manual has an extensive section on important tips and caveats related to forbidden parameters. Here are some major points

  • SMAC generates random configurations without honoring the forbidden clauses, but rejects those that violate at least one. So constraining the space too much will slow down the process.

  • Even meaningless forbidden clauses (those that are always false) still take time to evaluate slowing SMAC down.

  • Applying arithmetic operations/function to non-numerical values for categoricals/ordinals leads to undefined behavior! For categorical types only == and != should be used. You may use >=, <=, <, and > for ordinal parameters.

  • Don’t use == and != for real types, as they are almost always false, or true respectively.

  • names and values of parameters must not contain anything besides alphanumeric characters and underscores.

  • When defining ordinal parameters, you have to keep the values consistent. E.g.

    parameter_definition = dict(\
        a = ("ordinal", ["warm", "medium", "cold"], "medium"),
        b = ("ordinal", ["cold", "medium", "warm"], "medium"),
        c = ("ordinal", [100, 10, 1], 100))
    

    The problem here is that “warm < cold” for a, but “warm > cold” for b. For numerical values the definition of c implies “100<10”, which is not true.

For more details, please refer to the SMAC manual.

Combining Parameter Configuration Spaces

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import sys
sys.path.append("../../pynisher")
sys.path.append('..')

import pysmac

import sklearn.ensemble
import sklearn.neighbors
import sklearn.datasets
import sklearn.cross_validation

# We use a random classification data set generated by sklearn
# As commonly done, we use a train-test split to avoid overfitting.
X,Y = sklearn.datasets.make_classification(1000, 20)
X_train, X_test, Y_train, Y_test = \
	sklearn.cross_validation.train_test_split(X,Y, test_size=0.33, random_state=1)


# training a random forest
def random_forest (n_trees=None, criterion=None, max_features=None, max_depth=None):
	predictor = sklearn.ensemble.RandomForestClassifier(n_trees, criterion, max_features, max_depth)
	predictor.fit(X_train, Y_train)
	return (-predictor.score(X_test, Y_test))

# and defining some of its parameters
parameters_trees = dict(\
	max_depth =  ("integer", [1,10],  4),
	max_features=("integer", [1,20], 10),
	n_trees=("integer", [1,100],10 ,'log'),          
	criterion =("categorical", ['gini', 'entropy'], 'entropy'),
	)

# training a k-nearest neighbor classifier
def knn (n_neighbors=None, weights=None):
	predictor = sklearn.neighbors.KNeighborsClassifier(n_neighbors, weights)
	predictor.fit(X_train, Y_train)
	return (-predictor.score(X_test, Y_test))

# and defining some of its paramers
parameters_knn = dict(\
	n_neighbors = ("integer", [1,100], 10, 'log'),
	weights = ("categorical", ['uniform', 'distance'], 'uniform'),
	)



# convenience function
from pysmac.utils.pcs_merge import merge_configuration_spaces

# returns a parameter config space, the conditionals, forbiddens and two wrapper functions
p,c,f,wrapper_str = merge_configuration_spaces(\
		(random_forest, parameters_trees, [], []),
		(knn, parameters_knn, [],[]))

# workaround to make the generated functions pickable (needed for pySMAC internals):
# they are generated as strings, and instantiated by executing this string
exec(wrapper_str)


# create optimizer object
opt = pysmac.SMAC_optimizer( debug = 0,
							 working_directory = '/tmp/pySMAC_test/',
							 persistent_files=True, )

# perform actual optimization
value, parameters = opt.minimize(pysmac_merged_pcs_wrapper, 
					#wrapper function generated by merge_configuration_spaces
                    50, # number of function evaluations
                    p, # parameter defintion
                    conditional_clauses = c,
                    forbidden_clauses = f)

print('The highest accuracy found: %f'%(-value))
print('Parameter setting %s'%parameters)


# For very complex configuration spaces, looking at the final values might be hard.
# You can use a function defined in 'wrapper_str' that finds the right callable and
# the corresponding arguments (with their original name). That way, it is more
# human readable.

func, kwargs = pysmac_merged_pcs_reduce_args(**parameters)
print("The best result was found calling '{}' with the arguments {}".format(func.__name__, kwargs))