February 2019 – Sisi Tang

February 12, 2019August 3, 2019

Neural Network

This post is my study notes of Andrew Ng’s course. https://www.andrewng.org/courses/

Neural network is a convolution of several logistic regressions. It allows some dependence between those regressions. Neural network incorporates more coefficients that will be learned from the date, so it should provide higher accuracy than a single logistic regression. The only thing we need to pay attention is over-fitting.

Here we use neural network with 3 layers (an input layer, a hidden layer, and an output layer) as an example for background information. The case of more layers is quite similar. In this article, our inputs are 25 by 25 pixels images. Since the images are of size $25 \times 25$ , this gives us $625$ input layer units (not counting the extra bias unit). The training data will be loaded into the variables $X$ and $y$ , where $X$ is the image, and $y$ is the label.

Let $m$ be the number of inputs(images in our case), and $K$ be the number of possible lables. The cost function for the neural network (without regularization) is

$\begin{aligned}J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^K &[-y_k^{(i)}\log(h_{\theta}(x^{(i)})k) \\ &- (1-y_k^{(i)}) \log(1-(h{\theta}(x^{(i)})_k)) ]\end{aligned}$

To avoid over-fitting, we use the cost function for neural networks with regularization

$\begin{aligned}J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^K [-y_k^{(i)}\log(h_{\theta}(x^{(i)})k) \\ - (1-y_k^{(i)}) \log(1-(h{\theta}(x^{(i)})k)) ] \\+\frac{\lambda}{2m} [ \sum_{j=1}^{25} \sum_{k=1}^{625} (\Theta_{j,k}^{(1)})^2 + \sum_{j=1}^{2}\sum_{k=1}^{25} (\Theta_{j,k}^{(2)})^2 ]\end{aligned}$

When training neural networks, it is important to randomly initialize the parameters for symmetry breaking. One effective strategy for random initialization is to randomly select values for $\Theta^{(l)}$ uniformly in the range $[-\epsilon,\epsilon]$ . We use $\epsilon=0.12$ . This range of values ensures that the parameters are kept small and makes the learning more efficient.

One effective strategy for choosing $\epsilon$ is to base it on the number of units in the network. A good choice of $\epsilon$ is $\epsilon = \frac{\sqrt{6}}{\sqrt{L_{in}+L_{out}}}$ , where $L_{in} = sl$ and $L_{out} = sl+1$ are the number of units in the layers adjacent to $\Theta^{(l)}$ .

The error of the neural network is obtained by the backpropagation algorithm. The intuition behind the backpropagation algorithm is as follows. Given a training example $(x(t), y(t))$ , we will first run a “forward pass” to compute all the activations throughout the network, including the output value of the hypothesis $h_{\Theta}(x)$ . Then, for each node $j$ in layer $l$ , we would like to compute
an error term $\delta_j^{(l)}$ that measures how much that node was responsible for any errors in our output.

For an output node, we can directly measure the difference between the network’s activation and the true target value, and use that to define $\delta^{(3)}_j$ (since layer 3 is the output layer). For the hidden units, we can compute $\delta^{(l)}_j$ based on a weighted average of the error terms of the nodes in layer $(l + 1)$ .

Procedure

In detail, here is the backpropagation algorithm. We should implement steps 1 to 4 in a loop that processes one example at a time. Concretely, we should implement a for-loop for $t=1:m$ and place steps 1-4 below inside the for-loop, with the t-th iteration performing the calculation on the t-th training example $(x(t), y(t))$ . Step 5 will divide the accumulated gradients by $m$ to obtain the gradients for the neural network cost function.

Set the input layer’s values $(a(1))$ to the t-th training example $x(t)$ . Perform a feedforward pass (Figure ??), computing the activations $(z(2); a(2); z(3); a(3))$ for layers 2 and 3. Note that we need to add $a+1$ term to ensure that the vectors of activations for layers $a(1)$ and $a(2)$ also include the bias unit.
For each output unit $k$ in layer 3 (the output layer), set
$\delta_k^{(3)} = (a^{(3)}_k - y_k)$
where $y_k \in [0,1]$ indicates whether the current training example belongs to class k ( $y_k = 1, k = 1,2$ ), or if it belongs to a different class ( $y_k = 0$ ).
For hidden layer $l=2$ , set
$\delta^{(2)} = (\Theta^{(2)})^T \delta^{(3)} .* g'(z^{(2)}).$
Accumulate the gradient from this example using the following formula. Note that we should skip or remove $\delta^{(2)}_0$ .
$\Delta^{(l)} = \Delta^{(l)} + \delta^{(l)}(a^{(l)})^T$
Obtain the (unregularized) gradient for the neural network cost function by dividing the accumulated gradients by $m$ :
$\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)} = \frac{1}{m} \Delta_{ij}^{(l)}$

To account for regularization, it turns out that we can add this as an additional term after computing the gradients using backpropagation. Specifically, after we have computed $\Delta^{(l)}_{ij}$ using backpropagation, we should add regularization using

$\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)} = \frac{1}{m} \Delta_{ij}^{(l)},\quad \mbox{for} j = 0.$

$\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)} = \frac{1}{m} \Delta_{ij}^{(l)} + \frac{\lambda}{m}\Theta_{ij}^{(l)}, \quad \mbox{for} j \geq 1.$

After we have successfully implemented the neural network cost function and gradient computation by feedforward propagation and backpropagation, the next step will be learning a good set of parameters by minimizing the cost function. Since the cost function is not convex, there is no guarantee that we can always find the global minimum. But we should try to increase the number of iterations in our minimizer(say gradient descent, or conjugate descent). And perform the solver several times since the initialization is random, and different initialization may results in different local minimum.

February 4, 2019March 8, 2021

Cross Hedging

When our study group read John Hull’s Options, Futures, and Other Derivatives 10th Edition book section 3.4 Cross Hedging, the hedging ratio $h^*$ was given directly in (3.1). We filled in the derivation of it here.

Goal

We want to hedge an asset using a future contract whose underlying asset is different from the one being hedged. This usually happens for example in the case that the asset being hedged is not available in the future market. If we are going to sell an asset, then we take long position in a future contract; short otherwise.

Setting

$h$ : hedge ratio. The ratio of the size of the position taken in futures contracts to the size of the exposure.
$\Delta S$ : change of spot price of the asset to be hedged.
$\Delta F$ : change of price in the future contract.

Formulation

Mathematically, we want to minimize the variance of our portfolio value, i.e.,

$\begin{aligned}\mbox{Var}(\Delta S-h\Delta F) &= \mathbf{E}[(\Delta S-\overline{\Delta S}-h(\Delta F-\overline{\Delta F}))^2] \\&= \mathbf{E}[(\Delta S-\overline{\Delta S})^2] + h^2\mathbf{E}[(\Delta F-\overline{\Delta F})^2] \\ & \quad - 2h\mathbf{E}[(\Delta S-\overline{\Delta S})(\Delta F-\overline{\Delta F})] \\&= \sigma_{\Delta S}^2+ h^2\sigma_{\Delta F}^2 - 2h~\mbox{cov}(\Delta S,\Delta F) ,\end{aligned}$

where $\sigma_{\Delta S}, \sigma_{\Delta F}$ denote the standard deviation of $\Delta S$ , standard deviation of $\Delta F$ respectively. This is a quadratic fucntion in $h$ . The minimum of it is achieved at

$h=\frac{\mbox{cov}(\Delta S,\Delta F)}{\sigma_{\Delta F}^2}=\rho\frac{\sigma_{\Delta S}}{\sigma_{\Delta F}},$

where $\rho$ is the correlation of $\Delta S$ and $\Delta F$ . This formula for $h$ is exactly the slope of linear regression $\Delta S \sim \Delta F$ . Our derivation here also explains why linear regression is the minimum variance estimator.

Then, the variance

$\mbox{Var}(\Delta S-h\Delta F)=(1-\rho^2)\sigma_{\Delta S}^2.$

Hedge effectiveness: the proportion of the variance that is eliminated by hedging.
The hedge effectiveness in our case is

$\eta = 1-\frac{\mbox{Var}(\Delta S-h\Delta F)}{\mbox{Var}(\Delta S)} = \rho^2,$

which is the $R^2$ of the linear regression $\Delta S \sim \Delta F$ .

Conclusion

The hedging ratio is

$h^*=\frac{\mbox{cov}(\Delta S,\Delta F)}{\sigma_{\Delta F}^2}=\rho\frac{\sigma_{\Delta S}}{\sigma_{\Delta F}}.$

Hedge effectiveness is $\rho^2$ .

February 3, 2019March 8, 2021

Bond Yield v.s. Bond Price

When our study group read John Hull’s Options, Futures, and Other Derivatives book section 4.10 Duration, there was a sentence that is not very intuitive: There is a negative relationship between bond yield and bond price. (When bond yields decrease, bond prices increase. When bond yields increase, bond price decrease.)

Intuitively, people would believe a high “yield” bond has high price. Here is an explanation to the anti-intuition fact.

Assumption
Suppose the coupon rate and time of payments of a bond are fixed. (It is fixed the time it is designed.) Namely, at time $t_1, \ldots, t_n$ , the bond holder will receive $c_1, \ldots, c_n$ dollars. The last payment $c_n$ contains the
notional/face value of the bond. At time $t=0$ , maket has a quote $B$ of this bond, i.e., how much people are willing to pay for this bond.

Formulation
Bond yield $y$ is defined as the discount rate such that the sum of the present values of all payments is equal to the bond price $B$ . That is to say, $y$ is solved from the following equation:

$B = \sum_{i=1}^n c_ie^{-yt_i}$

In the above equation, there are only two variables $y$ and $B$ . Hence, $B$ is a function of $y$ and vice versa.

$B=B(y)$

Then,

$\mathrm{d}B = \frac{ \mathrm{d} B}{ \mathrm{d} y} \mathrm{d}y.$

No partial derivatives because $c_i$ and $t_i$ are fixed. They are not variables.

Approximately, small change $\Delta y$ in bond yield would result in the change of the bond price being approximately

$\Delta B = \frac{ \mathrm{d} B}{ \mathrm{d} y} \Delta y.$

That is,

$\Delta B = -\sum_{i=1}^n c_it_ie^{-yt_i} \Delta y.$

Duration of a bond is defined to be the weighted average of the time of coupon payments and notional, where the weight is the present value. Namely,

$D := \sum_{i=1}^n t_i\left[\frac{c_ie^{-yt_i}}{B}\right]$

Plug this into the equation of $\Delta B$ , we get

$\Delta B = - BD \Delta y.$

Explanation
Mathematically, if $\Delta y$ is positive, then $\Delta B$ is negative. If $\Delta y$ is negative, then $\Delta B$ is positive.
Intuitively, bond yield and bond price are two ways to quote a bond; just like in describing fuel efficiency, you can use miles/gallon or liters/hundred miles. There is a one to one correspondence between bond yield and bond price. Higher bond yield is corresponding to a lower bond price. We can not say higher bond yield results in lower bond price.

The usual “yield” in people’s mind is actually coupon rate. Higher coupon rate definitely results in higher bond price (if other conditions keep the same).

The market usually quotes bond yield. People can get the approximated bond price change by the formula $\Delta B = - BD \Delta y$ .

What causes the change of a bond price?
Change of interest rate (risk-free rate).
If the interest rate rockets, then the present value of a payment in the future will worth less money. Then, th bond price, which is the sum of present values of all payments in the future, will decrease.