Sisi – Page 4 – Sisi Tang

March 8, 2019March 8, 2021

European and American Put-Call Parity

We assume no dividend and positive risk-free interest rate.

European put-call parity

European put and call option with same maturity $T$ and strike $K$ satisfy the put-call parity:

$C_E - P_E = S_0 - Ke^{-rT},$

where $C_E$ is the price of European call option, $P_E$ is the price of the European put option, $S_0$ is the price of the underlying asset at time $t=0$ .

$C_E - P_E$ can be seen as a forward contract with maturity $T$ and strike $K$ . A short proof of European put-call parity is as follows:

$(S_T-K)^+ + (K_T-S)^+ = S_T - K$

That is to say the terminal payoff of long call and short put is equal to that of forward (with the same maturity $T$ and strike $K$ ). Hence,

$\begin{aligned}& P(0,T)E[(S_T-K)^+] + P(0,T)E[(K-S_T)^+] \\ & = P(0,T)E[S_T - K],\end{aligned}$

where $P(0,T)$ is the discount factor from $T$ to $0$ , and $E$ is the expectation under the risk neutral measure. Above equation is equivalent to the European put-call parity formula.

Never prematurely exercise American call option

If we wait until maturity, the profit of the call option is $(S_T-K)^+$ . If we exercise the option at time $t$ , then we have $-K$ cash position and a stock the worth $S_t$ at this time. Then, at time $T$ , the total portfolio value would be $S_T-Ke^{r(T-t)}$ . That is

$C_E = C_A$

But if the underlying asset pays a dividend, then it might be optimal to exercise the call option early.

American put-call parity

American put and call option satisfies the following inequality:

$S_0 - K \leq C_A-P_A \leq S_0 - Ke^{-rT}$

For the first inequality,

Suppose at time 0, we have the following portfolio: long a call option, short a put option, short underlying asset, and have $K$ cash.

At $t=0$ , the portfolio value is $C_A - P_A - S_0 + K$ .

If the long position side of the put option decides to exercise the option, we then exercise our call option at the same time, otherwise, wait until maturity to decide exercise or not. With this strategy, call and put has the same exercise time and hence can be seen as a forward with maturity undetermined. Say, the maturity is $t \in [0,T]$ .
Then, at time $t$ , the value of our portfolio is $S_t-K$ for the option part and $-S_t + Ke^{rt}$ for the asset and cash part. The total value of the portfolio is $Ke^{rt} - K > 0$ .

By the non-arbitrage principal, we have

$C_A - P_A - S_0 + K \geq 0$

For the second inequality,

$\begin{aligned} &~ C_A - P_A \\ = &~ C_E - P_E + P_E - P_A \\ = &~ S_0 - K e^{-rT} + P_E - P_A \\ \leq &~ S_0 - Ke^{-rT}\end{aligned}$

March 8, 2019March 8, 2021

European call option v.s. Asian call option

Let a stock follow a Geometric Brownian motion

$\textnormal{d}S_t = S_t(r \textnormal{d} t + \sigma \textnormal{d} W_t)$

with constant $r,\sigma > 0$ . Let

$A_T = \frac{1}{T}\int_0^T S_u \textnormal{d} u$

European Asian call option has the payoff $(A_T-K)^+$ and European vanilla call option has the payoff $(S_T-K)^+$ . Then European vanilla option has higher value.

Proof.

By (1)Jensen’s inequality, (2)Fubini, and (3)the fact that the longer the maturity is the higher the vanilla European option’s value is, we have

$\begin{aligned} &~ \mathbb{E}[e^{-rT}(A_T-K)^+] \\= &~ \mathbb{E}[e^{-rT}(\frac{1}{T}\int_0^T S_u \textnormal{d} u -K)^+] \\\leq & ~ \mathbb{E}[e^{-rT}\frac{1}{T}\int_0^T (S_u-K)^+ \textnormal{d} u] \\= &~ \frac{1}{T}\int_0^T \mathbb{E}[e^{-rT} (S_u-K)^+ ] \textnormal{d} u \\\leq &~ \frac{1}{T}\int_0^T \mathbb{E}[e^{-ru} (S_u-K)^+ ] \textnormal{d} u \\\leq &~ \frac{1}{T}\int_0^T \mathbb{E}[e^{-rT} (S_T-K)^+ ] \textnormal{d} u \\ = &~ \mathbb{E}[e^{-rT} (S_T-K)^+ ] \\ \end{aligned}$

Notice that we didn’t use the dynamics of Geometric Brownian motion. Above deduction is true as long as (3) is true.

February 12, 2019August 3, 2019

Neural Network

This post is my study notes of Andrew Ng’s course. https://www.andrewng.org/courses/

Neural network is a convolution of several logistic regressions. It allows some dependence between those regressions. Neural network incorporates more coefficients that will be learned from the date, so it should provide higher accuracy than a single logistic regression. The only thing we need to pay attention is over-fitting.

Here we use neural network with 3 layers (an input layer, a hidden layer, and an output layer) as an example for background information. The case of more layers is quite similar. In this article, our inputs are 25 by 25 pixels images. Since the images are of size $25 \times 25$ , this gives us $625$ input layer units (not counting the extra bias unit). The training data will be loaded into the variables $X$ and $y$ , where $X$ is the image, and $y$ is the label.

Let $m$ be the number of inputs(images in our case), and $K$ be the number of possible lables. The cost function for the neural network (without regularization) is

$\begin{aligned}J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^K &[-y_k^{(i)}\log(h_{\theta}(x^{(i)})k) \\ &- (1-y_k^{(i)}) \log(1-(h{\theta}(x^{(i)})_k)) ]\end{aligned}$

To avoid over-fitting, we use the cost function for neural networks with regularization

$\begin{aligned}J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^K [-y_k^{(i)}\log(h_{\theta}(x^{(i)})k) \\ - (1-y_k^{(i)}) \log(1-(h{\theta}(x^{(i)})k)) ] \\+\frac{\lambda}{2m} [ \sum_{j=1}^{25} \sum_{k=1}^{625} (\Theta_{j,k}^{(1)})^2 + \sum_{j=1}^{2}\sum_{k=1}^{25} (\Theta_{j,k}^{(2)})^2 ]\end{aligned}$

When training neural networks, it is important to randomly initialize the parameters for symmetry breaking. One effective strategy for random initialization is to randomly select values for $\Theta^{(l)}$ uniformly in the range $[-\epsilon,\epsilon]$ . We use $\epsilon=0.12$ . This range of values ensures that the parameters are kept small and makes the learning more efficient.

One effective strategy for choosing $\epsilon$ is to base it on the number of units in the network. A good choice of $\epsilon$ is $\epsilon = \frac{\sqrt{6}}{\sqrt{L_{in}+L_{out}}}$ , where $L_{in} = sl$ and $L_{out} = sl+1$ are the number of units in the layers adjacent to $\Theta^{(l)}$ .

The error of the neural network is obtained by the backpropagation algorithm. The intuition behind the backpropagation algorithm is as follows. Given a training example $(x(t), y(t))$ , we will first run a “forward pass” to compute all the activations throughout the network, including the output value of the hypothesis $h_{\Theta}(x)$ . Then, for each node $j$ in layer $l$ , we would like to compute
an error term $\delta_j^{(l)}$ that measures how much that node was responsible for any errors in our output.

For an output node, we can directly measure the difference between the network’s activation and the true target value, and use that to define $\delta^{(3)}_j$ (since layer 3 is the output layer). For the hidden units, we can compute $\delta^{(l)}_j$ based on a weighted average of the error terms of the nodes in layer $(l + 1)$ .

Procedure

In detail, here is the backpropagation algorithm. We should implement steps 1 to 4 in a loop that processes one example at a time. Concretely, we should implement a for-loop for $t=1:m$ and place steps 1-4 below inside the for-loop, with the t-th iteration performing the calculation on the t-th training example $(x(t), y(t))$ . Step 5 will divide the accumulated gradients by $m$ to obtain the gradients for the neural network cost function.

Set the input layer’s values $(a(1))$ to the t-th training example $x(t)$ . Perform a feedforward pass (Figure ??), computing the activations $(z(2); a(2); z(3); a(3))$ for layers 2 and 3. Note that we need to add $a+1$ term to ensure that the vectors of activations for layers $a(1)$ and $a(2)$ also include the bias unit.
For each output unit $k$ in layer 3 (the output layer), set
$\delta_k^{(3)} = (a^{(3)}_k - y_k)$
where $y_k \in [0,1]$ indicates whether the current training example belongs to class k ( $y_k = 1, k = 1,2$ ), or if it belongs to a different class ( $y_k = 0$ ).
For hidden layer $l=2$ , set
$\delta^{(2)} = (\Theta^{(2)})^T \delta^{(3)} .* g'(z^{(2)}).$
Accumulate the gradient from this example using the following formula. Note that we should skip or remove $\delta^{(2)}_0$ .
$\Delta^{(l)} = \Delta^{(l)} + \delta^{(l)}(a^{(l)})^T$
Obtain the (unregularized) gradient for the neural network cost function by dividing the accumulated gradients by $m$ :
$\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)} = \frac{1}{m} \Delta_{ij}^{(l)}$

To account for regularization, it turns out that we can add this as an additional term after computing the gradients using backpropagation. Specifically, after we have computed $\Delta^{(l)}_{ij}$ using backpropagation, we should add regularization using

$\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)} = \frac{1}{m} \Delta_{ij}^{(l)},\quad \mbox{for} j = 0.$

$\frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)} = \frac{1}{m} \Delta_{ij}^{(l)} + \frac{\lambda}{m}\Theta_{ij}^{(l)}, \quad \mbox{for} j \geq 1.$

After we have successfully implemented the neural network cost function and gradient computation by feedforward propagation and backpropagation, the next step will be learning a good set of parameters by minimizing the cost function. Since the cost function is not convex, there is no guarantee that we can always find the global minimum. But we should try to increase the number of iterations in our minimizer(say gradient descent, or conjugate descent). And perform the solver several times since the initialization is random, and different initialization may results in different local minimum.