Skewness and Tailedness

Skewness describes how symmetric distribution is. It is defined as the third moment of the distribution after normalization:

    \[\mbox{Skew}[X]:= \mathbb{E}[(\frac{X-\mu}{\sigma})^3]\]

Kurtosis is a measure of tailedness. The heavier the tail is, the larger its Kurtosis is. Mathematically, it is defined as

    \[\mbox{Kurt}[X]:= \mathbb{E}[(\frac{X-\mu}{\sigma})^4]=\frac{\mu_4}{\sigma^4}=\frac{\mu_4}{\mu_2^2},\]

where \mu is the mean, and \sigma is the standard deviation of the distribution of X; and \mu_k is k-th central moment.

It can be shown that the Kurtosis of Gaussian distribution is \mbox{Kurt}[N(0,1)] = 3. People usually use excess Kurtosis as the extra Kurtosis of a distribution compared with standard Gaussian distribution. Namely,

    \[\mbox{Excess Kurt}[X] := \mbox{Kurt}[X] - 3.\]

We can create a plot with the square of Skewness as its x-axis and Kurtosis as its y-axis. This plot is called Cullen and Frey graph.

Image result for cullen and Frey graph

This graph helps us to determine which distribution our data is closest to.

Gaussian Mixture Model

Gaussian mixture model (GMM) is a probability model for a mixture of several Gaussian distributions with possibly different mean and variance.

For example, we can model the 100m race time of all grade 12 students in a high school as two normal distributions: one for female students and one for male students. It is reasonable to expect two groups have different mean and may different variance.

When to use Gaussian mixture model?

1. Data has more than one clusters.
In the following picture, the left one models the data with one normal distribution; the right one models the data by two normal distribution, Gaussian mixture model. Obviously, the right one better describes the data.

Pictures are from https://brilliant.org/wiki/gaussian-mixture-model/

2. Each cluster is theoretically normally distributed.

Theory of Gaussian Mixture Model

1. Gaussian distribution in 1 dimension
Since there are several Gaussian distributions in the GMM. We assign an index to each Gaussian distribution: k for k=1,2,..., K where K is the number of clusters. For a given mean \mu_k and variance \sigma_k, the probability density function is

    \[p_k(x|\mu_k,\sigma_k) = \frac{1}{\sigma_k\sqrt{2\pi}}exp\left(-\frac{(x-\mu_k)^2}{2\sigma_k^2}\right)\]


Above | is not the mathematical conditional expectation, but a statistical way of saying we know true parameters \mu_k,\sigma_k in advance.

2. Gaussian mixture model in 1 dimension
The probability density function of GMM is the weighted average of several Gaussian densities:

    \[p(x|\mu_k,\sigma_k, k=1,2,...,K) = \sum_{k=1}^K w_k \cdot p_k(x|\mu_k,\sigma_k ),\]


where w_k \in [0,1] satisfies

    \[\sum_{k=1}^K w_k  = 1\]


Plug in the Gaussian density,

    \[p(x) =  \sum_{k=1}^K w_k   \frac{1}{\sigma_k\sqrt{2\pi}}exp\left(-\frac{(x-\mu_k)^2}{2\sigma_k^2}\right)\]

Note that this is a density function because its integral on \mathbb{R} is 1.

3. Gaussian mixture model in n-dimension
Let x \in \mathbb{R}^n be an n-dimension multivariate Gaussian random variable with mean vector
\mathbf{\mu_k} and covariance matrix \Sigma_k. Then the probability density function is

    \[p_k(x|\mathbf{\mu_k} ,  \Sigma_k ) = \frac{1}{\sqrt{(2\pi)^K|\Sigma_k|}}exp\left(-\frac{1}{2}(x-\mathbf{\mu}_k)^T\Sigma_k^{-1}(x-\mathbf{\mu}_k )\right)\]


Then, the probability density function of GMM, which is the weighted average of serveral multivariate Gaussian density, is

    \[p(x) =   \sum_{k=1}^K w_k  \frac{1}{\sqrt{(2\pi)^K|\Sigma_k|}}exp\left(-\frac{1}{2}(x-\mathbf{\mu}_k)^T\Sigma_k^{-1}(x-\mathbf{\mu}_k )\right)\]


with

    \[\sum_{k=1}^K w_k  = 1\]

Training the Model

Suppose that we know the number of clusters K a priori. (The choice of K relies on statistician’s experience.) Then, we can use Expectation Maximization (EM) algorithm to find the parameters \hat{w_k}, \hat{\mu_k} and \hat{\sigma_k} or \hat{\Sigma_k} for multi-dimensional model. Let K be the number of clusters, and N be the number of samples.

Step 1: Initialize

  • Randomly choose K samples and set them to be the group mean. For example, in the case of K=2, N=300, \hat{\mu_1} = x_{57}, \hat{\mu_2} = x_{193}. (note that this is also valid for multi-dimensional case)
  • Set all variances (resp. covariance matrices) to be the same value: sample variance (resp. sample covariance matrix). Namely,

        \[\hat{\sigma_k}^2 = \frac{1}{N}\sum_i^N(x_i-\bar{x})^2,\]

    where \bar{x}=\sum_1^Nx_i/N.
  • Set all weights equal to 1/K, i.e.,

        \[\hat{w_k} = \frac{1}{K}.\]

Step 2: Expectation

We compute the probability that a sample x_n belongs to cluster C_k.

    \[\begin{aligned} \gamma_{nk}  = &~ \mathbf{P}(x_n \in C_k|x_n, \hat{w_k}, \hat{\mu_k},\hat{\sigma_k}  ) \\    = &~ \frac{\hat{w_k}p_k(x_n|\hat{\mu_k},\hat{\sigma_k})}{\sum_{j=1}^K \hat{w_j}p_j(x_n|\hat{\mu_j},\hat{\sigma_j}) } \end{aligned} \]

Step 3: Maximization

Update parameters then go back to step 2 until converge

  •     \[\hat{w_k} = \frac{\sum_{n=1}^N\gamma_{nk}}{N}\]

  •     \[\hat{\mu_k} =  \frac{\sum_{n=1}^N\gamma_{nk}x_i}{ \sum_{n=1}^N\gamma_{nk} }\]

  •     \[\hat{\sigma_k} =  \frac{\sum_{n=1}^N\gamma_{nk}(x_n-\hat{\mu_k} )^2}{ \sum_{n=1}^N\gamma_{nk} }\]

    resp.

        \[\hat{\Sigma_k} =  \frac{\sum_{n=1}^N\gamma_{nk}(x_n-\hat{\mu_k} )(x_n-\hat{\mu_k} )^T }{ \sum_{n=1}^N\gamma_{nk} }\]

Neural Network

This post is my study notes of Andrew Ng’s course. https://www.andrewng.org/courses/

Neural network is a convolution of several logistic regressions. It allows some dependence between those regressions. Neural network incorporates more coefficients that will be learned from the date, so it should provide higher accuracy than a single logistic regression. The only thing we need to pay attention is over-fitting.

Here we use neural network with 3 layers (an input layer, a hidden layer, and an output layer) as an example for background information. The case of more layers is quite similar. In this article, our inputs are 25 by 25 pixels images. Since the images are of size 25 \times 25, this gives us 625 input layer units (not counting the extra bias unit). The training data will be loaded into the variables X and y, where X is the image, and y is the label.

Let m be the number of inputs(images in our case), and K be the number of possible lables. The cost function for the neural network (without regularization) is

    \[\begin{aligned}J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^K &[-y_k^{(i)}\log(h_{\theta}(x^{(i)})k) \\ &- (1-y_k^{(i)}) \log(1-(h{\theta}(x^{(i)})_k)) ]\end{aligned}\]

To avoid over-fitting, we use the cost function for neural networks with regularization

    \[\begin{aligned}J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^K [-y_k^{(i)}\log(h_{\theta}(x^{(i)})k) \\ - (1-y_k^{(i)}) \log(1-(h{\theta}(x^{(i)})k)) ] \\+\frac{\lambda}{2m} [ \sum_{j=1}^{25} \sum_{k=1}^{625} (\Theta_{j,k}^{(1)})^2 + \sum_{j=1}^{2}\sum_{k=1}^{25} (\Theta_{j,k}^{(2)})^2 ]\end{aligned}\]

When training neural networks, it is important to randomly initialize the parameters for symmetry breaking. One effective strategy for random initialization is to randomly select values for \Theta^{(l)} uniformly in the range [-\epsilon,\epsilon]. We use \epsilon=0.12. This range of values ensures that the parameters are kept small and makes the learning more efficient.

One effective strategy for choosing \epsilon is to base it on the number of units in the network. A good choice of \epsilon is \epsilon = \frac{\sqrt{6}}{\sqrt{L_{in}+L_{out}}}, where L_{in} = sl and L_{out} = sl+1 are the number of units in the layers adjacent to \Theta^{(l)}.

The error of the neural network is obtained by the backpropagation algorithm. The intuition behind the backpropagation algorithm is as follows. Given a training example (x(t), y(t)), we will first run a “forward pass” to compute all the activations throughout the network, including the output value of the hypothesis h_{\Theta}(x). Then, for each node j in layer l, we would like to compute
an error term \delta_j^{(l)} that measures how much that node was responsible for any errors in our output.

For an output node, we can directly measure the difference between the network’s activation and the true target value, and use that to define \delta^{(3)}_j (since layer 3 is the output layer). For the hidden units, we can compute \delta^{(l)}_j based on a weighted average of the error terms of the nodes in layer (l + 1).

Procedure

In detail, here is the backpropagation algorithm. We should implement steps 1 to 4 in a loop that processes one example at a time. Concretely, we should implement a for-loop for t=1:m and place steps 1-4 below inside the for-loop, with the t-th iteration performing the calculation on the t-th training example (x(t), y(t)). Step 5 will divide the accumulated gradients by m to obtain the gradients for the neural network cost function.

  1. Set the input layer’s values (a(1)) to the t-th training example x(t). Perform a feedforward pass (Figure ??), computing the activations (z(2); a(2); z(3); a(3)) for layers 2 and 3. Note that we need to add a+1 term to ensure that the vectors of activations for layers a(1) and a(2) also include the bias unit.
  2. For each output unit k in layer 3 (the output layer), set

        \[\delta_k^{(3)} = (a^{(3)}_k - y_k)\]

    where y_k \in [0,1] indicates whether the current training example belongs to class k (y_k = 1, k = 1,2), or if it belongs to a different class (y_k = 0).
  3. For hidden layer l=2, set

        \[\delta^{(2)} = (\Theta^{(2)})^T \delta^{(3)} .* g'(z^{(2)}).\]

  4. Accumulate the gradient from this example using the following formula. Note that we should skip or remove \delta^{(2)}_0.

        \[\Delta^{(l)} = \Delta^{(l)} + \delta^{(l)}(a^{(l)})^T\]

  5. Obtain the (unregularized) gradient for the neural network cost function by dividing the accumulated gradients by m:

        \[\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)} = \frac{1}{m} \Delta_{ij}^{(l)}\]

To account for regularization, it turns out that we can add this as an additional term after computing the gradients using backpropagation. Specifically, after we have computed \Delta^{(l)}_{ij} using backpropagation, we should add regularization using

    \[\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)} = \frac{1}{m} \Delta_{ij}^{(l)},\quad \mbox{for} j = 0.\]


    \[\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)} = \frac{1}{m} \Delta_{ij}^{(l)} + \frac{\lambda}{m}\Theta_{ij}^{(l)}, \quad \mbox{for} j \geq 1.\]

After we have successfully implemented the neural network cost function and gradient computation by feedforward propagation and backpropagation, the next step will be learning a good set of parameters by minimizing the cost function. Since the cost function is not convex, there is no guarantee that we can always find the global minimum. But we should try to increase the number of iterations in our minimizer(say gradient descent, or conjugate descent). And perform the solver several times since the initialization is random, and different initialization may results in different local minimum.