AutoML and Neural Architecture Search
Introduction
AutoML aims to automate the difficult part of machine learning, including hyperparameter optimization, automated feature engineering, automated model selection, and automated pipeline construction. Neural architecture search (NAS) is a subfield of AutoML that uses reinforcement learning, evolutionary method, or neural network method itself to search for architecture design since designing neural networks is still considered hard. A recurrent network can be used to generate model descriptions of neural network (since the structure and connectivity of a neural network can be specified by a string). The RNN is then trained in RL to maximize the expected accuracy of the generated models on a validation set. On CIFAR-10, the method can design from scratch a network architecture that is competitive. It also reaches state of the art result on Penn Treebank dataset.
There are different methods to automate machine learning. Hyperparameter optimization, for example, is limited in that it only searches a fixed length space. It doesn’t generate a variable length configuration that works with the structure and connectivity of a network. There is also Bayesian optimization methods that can do the search on non fixed length architectures, but they are still limited. Modern neuro-evolution methods are more flexible in creating new models, but are less practical at scale. Similar ideas to neural architecture search, there are program synthesis, inductive programming, BLEU optimization, learning to learn (meta learning).
Neural architecture search
For neural architecture search, a controller is used to generate architectural hyperparameters of convolutional neural network. The parameters include number of filters, filter height, filter width, stride height, stride width. There is a maximum number of layers to be generated. After the description is generated, the model is built and trained. Then the accuracy is evaluated for a held-out validation set. Then the RNN is trained under REINFORCE, a policy gradient method to update its parameters for better generation of architecture over time. The accuracy R on test set is the reward in the RL framework. And the target of the RL process is to maximize this expected accuracy \(J(\theta) = E{[R]}\). Since the reward is not differential, a policy gradient method is used to update parameters \(\theta\).
\(\nabla J(\theta) = \sum_{t=1}^{T} E{[\nabla log P(a_t \mid a_{t-1}; \theta) R ]}\) An empirical approximation: \(\frac{1}{m} \sum_{k=1}^{m} \sum_{t=1}^{T} \nabla log P(a_t \mid a_{t-1}; \theta) R_k\) with m being the number of different architectures in one batch and T being the number of hyperparameters used to design an architecture. \(R_k\) is the validation accuracy that the kth architecture get after being built and trained. For a version that has lower variance \(\frac{1}{m} \sum_{k=1}^{m} \sum_{t=1}^{T} \nabla log P(a_t \mid a_{t-1}; \theta) (R_k - b)\) with b being the baseline (exponential moving average of the previous architecture accuracies). The training is a distributed system. There is a parameter server that store the shared parameters. Each controller samples m different child architectures and train in parallel. Then the gradients are collected and sent back to the server. Then the ability to add skip connection and branching layers (as in GoogleNet and Residual Net) is added.
For a text task, the authors design a process to design RNN cells. They predict the operations (add, element multiplication, inner product) and the activation function (tanh, relu, sigmoid).
Specifically, for the image classification task, CIFAR-10 is used. Images are whitened, upsampled then randomly cropped at 32x32. Finally, images are randomly flipped horizontally. The search space for the task is about convolutional architectures, with RELU for non linearity, batch normalization, and skip connections. For each convolutional layer, the filter height is [1,3,5,7], the width is [1,3,5,7], the number of filters is [24,36,48,64]. The strides are in [1,2,3]. The architecture generator is called the controller. It is a two layer LSTM with 35 hidden units on each layer. It is trained with Adam optimizer, learning rate of 0.0006. The weights are initialized uniformly between -0.08 and 0.08. Once a model is sampled, it is built and trained for 50 epochs. The training set has 45000 images and the test set has 5000. Momentum optimizer is used, learning rate of 0.1, weight decay of 1e-4, momentum of 0.9. The accuracy of the last 5 epochs cubed is fed back to the controller as reward. The controllers gradually generate more complex (more layers) model over time.
So the controller is able to design a 15 layer network that achieves 5.5% error rate on the test set. It is the shallowest among the human designed top performers. It has many skip connections and has an interesting system of filters: the filters are rectangular and the one at top layers are bigger. There are other architectures that perform competitively with the human designed top performers, but with many more layers (39) and filters (40).
For the language modelling task, NAS is applied to the Penn Treebank dataset.
The result is that NAS designed models outperform state of the art models. Not only they are better, they are also faster. The new cell architecture is similar to LSTM in the beginning, since it calculates similarly and then send the result to different components in the cell. The newly found cell is named NASCell and is in Tensorflow.
Further control experiments are run that add more functions to expand search space and make comparison to random search.
ENAS
NAS is evolved into ENAS (efficient neural architecture search). They achieves strong performances, using much less GPU than NAS (1000x less). The central idea is to view all the graphs that NAS iterates over to be sub-graphs of a larger graph.
The controller in ENAS is an LSTM with 100 hidden unit and is autoregressive: the decision in the previous step is fed as input for the next step, at the first step, the input is empty. In ENAS, apart from the parameters \(\theta\) of the controller, there are the shared parameters w of the child models. The first phase trains for w. For the Penn Treebank task, w is trained for 400 steps, each with a minibatch of 64, where the gradient \(\nabla_w\) is calculated with backprop through time. For CIFAR-10, w is trained on 45000 images, using minibatch of 128, \(\nabla_w\) is backproped as usual. The second phase trains for \(\theta\), for about 2000 steps.
Specifically, the controller’s policy is \(\pi(m;\theta)\), SGD is performed on w to minimize the expected loss \(E{[L(m,w)]}\). L here is the cross entropy loss, on minibatches of the data, with m being sampled from \(\pi(m;\theta)\).
\(\nabla_w E{[L(m;w)]} \approx \frac{1}{M} \sum_{i=1}^M \nabla_w L(m, w)\).
Then the w is fixed and we update \(\theta\), to maximize the expected reward \(E{[R(m,w)]}\), with Adam optimizer, the gradient is under REINFORCE, with a moving average baseline to reduce variance. The reward R(m,w) is calculated on the validation set (on minibatches).
For the convolutional task, the controller RNN samples two sets of decision at each decision: what previous nodes to connect to (this forms skip connection) and what computation operation to use (there are convolutions with filters, max pooling and average pooling). These decisions build a layer in the network.
The ENAS cell, apart from being competitive with current benchmarks, also has some interesting properties. Firstly, all the nonlinearity of the cell is due to either ReLU or tanh, even though the author adds identity and sigmoid as well. Secondly, it is a local optimum since the performance drops when it is pertubed. Thirdly, its architecture is very similar to a benchmark (the MoC), but with some added balance that increases the expressiveness and depth of the cell.
For the task on CIFAR-10, the standard preprocessing and augmentation techniques are applied: substracting the channel mean and dividing by the channel standard deviation, padding the image to 40x40 then randomly cropping it back to 32x32, randomly flipping them horizontally. The shared parameters w are trained with Nesterov momentum, with learning rate scheme to be the cosine schedule, initialized with He initialization, weight decay of \(10^{-4}\). For the architecture, the policy parameters \(\theta\) are initialized uniformly in [-0.1,0.1], trained with Adam optimizaer at learning rate of 0.00035.
For the result, DenseNet architecture designed by human expert are holding the highest performance (test error rate of 2.56%). ENAS, when trained to design an entire convolutional network, finds an architecture that achieves 3.87% test error, in 7 hours. Empirically, ENAS is far better than guided random search.
Conclusion
In conclusion, automated machine learning (AutoML) and neural architecture search (NAS) represent significant advancements in the field of artificial intelligence. AutoML, through automating the selection of models, hyperparameter tuning, and feature selection, effectively removes a lot of the complexity and difficulty involved in creating machine learning models. This enables a broader range of people, even those without specialized machine learning training, to implement machine learning solutions. As a subfield, NAS provides an efficient way to discover the best neural network architecture for a given dataset and task.
However, it is to be noted that these approaches do not eliminate the need for domain knowledge and understanding of machine learning principles, which are still crucial for defining the problem, preprocessing the data, and evaluating the results. Going forward, the ongoing development of more efficient search and optimization algorithms, as well as methods for improving the transparency and interpretability of AutoML and NAS, will be important areas of research.