This post is my study notes of Andrew Ng’s course. https://www.andrewng.org/courses/
Here we use
Let be the number of inputs(images in our case), and be the number of possible lables. The cost function for the neural network (without regularization) is
To avoid over-fitting, we use the cost function for neural networks with regularization
When training neural networks, it is important to randomly initialize the parameters for symmetry breaking. One effective strategy for random initialization is to randomly select values for uniformly in the range . We use . This range of values ensures that the parameters are kept small and makes
One effective strategy for choosing is to base it on the number of units in the network. A good choice of is , where and are the number of units in the layers adjacent to .
The error of the neural network is obtained by the backpropagation algorithm. The intuition behind the backpropagation algorithm is as follows. Given a training example , we will first run a “forward pass” to compute all the activations throughout the network, including the output value of the hypothesis . Then, for each node in layer , we would like to compute
an error term that measures how much that node was responsible for any errors in our output.
For an output node, we can directly measure the difference between the network’s activation and the true target value, and use that to define (since layer 3 is the output layer). For the hidden units, we can compute based on a weighted average of the error terms of the nodes in layer .
Procedure
In detail, here is the backpropagation algorithm. We should implement steps 1 to 4 in a loop that processes one example at a time. Concretely, we should implement a for-loop for and place steps 1-4 below inside the for-loop, with the t-
- Set the input layer’s values to the t-th training example . Perform a feedforward pass (Figure ??), computing the activations for layers 2 and 3. Note that we need to add term to ensure that the vectors of activations for layers and also include the bias unit.
- For each output unit in layer 3 (the output layer), set
- For hidden layer , set
- Accumulate the gradient from this example using the following formula. Note that we should skip or remove .
- Obtain the (unregularized) gradient for the neural network cost function by dividing the accumulated gradients by :
To account for regularization, it turns out that we can add this as an additional term after computing the gradients using backpropagation. Specifically, after we have computed using backpropagation, we should add regularization using
After we have successfully implemented the neural network cost function and gradient computation by feedforward propagation and backpropagation, the next step will be learning a good set of parameters by minimizing the cost function. Since the cost function is not convex, there is no guarantee that we can always find the global minimum. But we should try to increase the number of iterations in our minimizer(say gradient descent, or conjugate descent). And perform the solver several times since the initialization is random, and different initialization may