# Hyperparameter Optimization with the Keras Tuner, Part 2

In part 1 of this series, I introduced the Keras Tuner and applied it to a 4 layer DNN. I demonstrated how to tune the number of hidden units in a Dense layer and how to choose the best activation function with the Keras Tuner. Recall that the tuner I chose was the RandomSearch tuner. It tries random combinations of the hyperparameters and selects the best outcome. But since there are so many possible combinations, this is a very limited optimization approach unless you have infinite time and GPUs available. So let’s try another approach. I am going to use a second built-in Keras tuner, the BayesianOptimization tuner.

# Bayesian Optimization

Bayesian Optimization is a method of optimizing a completely unknown objective function. In our case this is the function which optimizes our DNN model’s predictive outcomes via the hyperparameters. In the Keras Tuner, a Gaussian process is used to “fit” this objective function with a “prior” and in turn another function called an acquisition function is used to generate new data about our objective function. The acquisition function used is upper confidence bounds (USB). If you want to see the mathematical explanation, please refer to the academic papers on the subject. There is also a good overview of Bayesian Learning by Nadheesh Jihan.

# Application to CIFAR10

## Baseline

In theory the BayesianOptimization tuner should produce a better set of hyperparamters than the RandomSearch tuner. First I applied it to the exact same model as before with the exact same hyperparameter possibilities. You can follow along in the accompanying notebook on github if you prefer. Here’s the hypermodel:

`model = tf.keras.Sequential()model.add(tf.keras.layers.Flatten(input_shape=(32, 32, 3)))model.add(tf.keras.layers.Dense(units=hp.Int('u_1', min_value=16,    max_value=256, step=16), activation=hp.Choice(name='a_1',                     values=['relu','tanh','elu','selu','swish'])))model.add(tf.keras.layers.Dense(units=hp.Int('u_2', min_value=16,       max_value=256, step=16), activation=hp.Choice(name='a_2',                       values=['relu','tanh','elu','selu','swish'])))model.add(tf.keras.layers.Dense(units=hp.Int('u_3', min_value=16, max_value=256, step=16), activation=hp.Choice(name='a_3',                       values=['relu','tanh','elu','selu','swish'])))model.add(tf.keras.layers.Dense(10, activation='softmax'))`

Then we configure our BayesianOptimization tuner and perform the search of the hyperparameter space.

`tuner = kt.BayesianOptimization(hypermodel=build_hypermodel,                                objective='val_loss',                                max_trials=25,                                num_initial_points=2,                                directory='test_dir',                                project_name='a',                                seed=seed)tuner.search(train_images, train_labels, epochs=200,             batch_size=BATCH_SIZE,              validation_data=(test_images, test_labels))`

After completing this, our new baseline model has the following architecture:

`Model: "sequential" _________________________________________________________________ Layer (type)                 Output Shape              Param #    ================================================================= flatten (Flatten)            (None, 3072)              0          _________________________________________________________________ dense (Dense)                (None, 256)               786688     _________________________________________________________________ dense_1 (Dense)              (None, 256)               65792      _________________________________________________________________ dense_2 (Dense)              (None, 16)                4112       _________________________________________________________________ dense_3 (Dense)              (None, 10)                170        ================================================================= Total params: 856,762 Trainable params: 856,762 Non-trainable params: 0`

Layers dense, dense_1, and dense_2 have activation functions (via tuning) of relu, swish, and relu respectively. This looks like a better model than our RandomSearch optimization produced, but when I trained it, I achieved almost identical metrics with a loss of 1.98 and accuracy of .480

## Depth and Width

Next I got more ambitious and allowed the tuner to optimize the number of Dense layers in the model and the number of hidden units in each of those Dense layers. This creates a very large hyperparameter space. Here is the hypermodel for tuning:

`model = tf.keras.Sequential()model.add(tf.keras.layers.Flatten(input_shape=(32, 32, 3)))for i in range(hp.Int('num_layers', 2, 4)):   model.add(             tf.keras.layers.Dense(units=hp.Int('units_'+str(i),                                   64, 512, 64),                                   activation='elu'))model.add(tf.keras.layers.Dense(10, activation='softmax'))`

A couple of things to note about the hypermodel are that inside the for loop we are using the same hp.Int hyperparameter to tune the hidden units of the Dense layer. Also in the for loop itself we are using a hp.Int named ‘num_layers’ to optimize how many Dense layers there will be in our model. Each of these Dense layers can have a different number of hidden units. After tuning with BayesianOptimization, the selected model looks like this:

`Model: "sequential" _________________________________________________________________ Layer (type)                 Output Shape              Param #    ================================================================= flatten (Flatten)            (None, 3072)              0          _________________________________________________________________ dense (Dense)                (None, 64)                196672     _________________________________________________________________ dense_1 (Dense)              (None, 256)               16640      _________________________________________________________________ dense_2 (Dense)              (None, 10)                2570       ================================================================= Total params: 215,882 Trainable params: 215,882 Non-trainable params: 0`

Interestingly it selected a shallow model using 2 Dense layers (plus the non optional final classifier, dense_2). When I fully trained and ran this on our test dataset, I achieved a loss of 1.98 and an accuracy of 0.481. Again, this is very close to the prior result.

## Depth only

As an experiment I decided to only tune the depth of the model with the following configured hypermodel:

`model = tf.keras.Sequential()model.add(tf.keras.layers.Flatten(input_shape=(32, 32, 3)))for i in range(hp.Int('num_layers', 2, 20)):   model.add(tf.keras.layers.Dense(128, activation='elu'))model.add(tf.keras.layers.Dense(10, activation='softmax'))`

In it I am only tuning the ‘num_layers’ parameter. All of the optional Dense layers will have the same size of hidden units, 128. Here is the tuned model produced by BayesianOptimization:

`Model: "sequential" _________________________________________________________________ Layer (type)                 Output Shape              Param #    ================================================================= flatten (Flatten)            (None, 3072)              0          _________________________________________________________________ dense (Dense)                (None, 128)               393344     _________________________________________________________________ dense_1 (Dense)              (None, 128)               16512      _________________________________________________________________ dense_2 (Dense)              (None, 10)                1290       ================================================================= Total params: 411,146 Trainable params: 411,146 Non-trainable params: 0`

Again we have a resulting model with only two optional Dense layers, dense and dense_1 above. The result after training and evaluation on the test set was 1.96 loss and 0.501 accuracy.

# Final Results

Using the BayesianOptimization Tuner did not produce significantly better results. Also, tuning the depth (number of layers) in the DNN model also produced very limited improvement. Results are show below:

I believe I have reached the limits (50% accuracy) of what my simple DNN can achieve with this dataset. So I am not seeing much improvement in the accuracy. But I hope you found the code instructive. In Part 3 of this series I will explore more hyperparameter tuning of the model’s training setup via the Keras Tuner. I will also use a different model architecture which should be more accurate than a DNN.