csci5521/assignments/hwk03/hw3_sol.typ

#let dfrac(a, b) = $display(frac(#a, #b))$

= Problem 1a

Given:

#let ww = $bold(w)$
#let xx = $bold(x)$
#let vv = $bold(v)$
#let XX = $bold(X)$

- $E(ww_1,ww_2,vv|XX) = - sum_t r^t log y^t + (1 - r^t) log(1 - y^t)$
- $y^t = "sigmoid"(v_2 z_2 + v_1 z_1 + v_0)$
- $z^t_1 = "ReLU"(w_(1,2)x^t_2 + w_(1,1)x^t_1 + w_(1,0))$
- $z^t_2 = tanh(w_(2,2)x^t_2 + w_(2,1)x^t_1 + w_(2,0))$

Using the convention $x_(j=1..D)$, $y_(i=1..K)$, and $z_(h=1..H)$.

Solved as:

- $
    frac(diff E, diff v_h) &= - sum_t frac(diff E, diff y^t) frac(diff y^t, diff v_h) \
    &= - sum_t (r^t dot frac(1, y^t) - (1-r^t) dot frac(1, 1-y^t)) (y^t z^t_h (1-y^t)) \
    &= - sum_t (frac(r^t, y^t) - frac(1-r^t, 1-y^t)) (y^t z^t_h (1-y^t)) \
    &= - sum_t (frac(r^t (1-y^t)-y^t (1-r^t), cancel(y^t) (1-y^t))) (cancel(y^t) z^t_h (1-y^t)) \
    &= - sum_t (frac(r^t - y^t, cancel(1-y^t))) (z^t_h cancel((1-y^t))) \
    &= - sum_t (r^t - y^t) z^t_h \
  $

- $
    frac(diff E, diff w_(1,j)) &= - sum_t frac(diff E, diff y^t) frac(diff y^t, diff z^t_h) frac(diff z^t_h, diff w_(1,j)) \
    &= - sum_t (frac(r^t, y^t) - frac(1-r^t, 1-y^t)) (y^t (1-y^t) v_1) (x_j cases(0 "if" ww_1 dot xx <0, 1 "otherwise")) \
    &= - sum_t (r^t - y^t) v_1 x_j cases(0 "if" ww_1 dot xx <0, 1 "otherwise") \
  $

- $
    frac(diff E, diff w_(2,j)) &= - sum_t frac(diff E, diff y^t) frac(diff y^t, diff z^t_h) frac(diff z^t_h, diff w_(2,j)) \
    &= - sum_t (r^t - y^t) v_2 x_j (1-tanh^2(ww_2 dot xx)) \
  $

Updates:

- $Delta v_h = eta sum_t (r^t-y^t) z^t_h$
- $Delta w_(1,j) = eta sum_t (r^t - y^t) v_1 x_j cases(0 "if" ww_1 dot xx <0, 1 "otherwise")$
- $Delta w_(2,j) = eta sum_t (r^t - y^t) v_2 x_j (1-tanh^2(ww_2 dot xx))$

= Problem 1b

- $E(ww,vv|XX) = - sum_t r^t log y^t + (1 - r^t) log (1 - y^t)$
- $y^t = "sigmoid"(v_2 z^t_2 + v_1 z^t_1 + v_0)$
- $z^t_1 = "ReLU"(w_2 x^t_2 + w_1 x^t_1 + w_0)$
- $z^t_2 = tanh(w_2 x^t_2 + w_1 x^t_1 + w_0)$

Updates:

- 
  Same as above:
  $Delta v_h = eta sum_t (r^t-y^t) z^t_h$

- $
    frac(diff E, diff w_j) &= - sum_t (frac(diff E, diff y^t) frac(diff y^t, diff z^t_1) frac(diff z^t_1, diff w_j)) + (frac(diff E, diff y^t) frac(diff y^t, diff z^t_2) frac(diff z^t_2, diff w_j))  \
    &= - sum_t frac(diff E, diff y^t) (frac(diff y^t, diff z^t_1) frac(diff z^t_1, diff w_j) + frac(diff y^t, diff z^t_2) frac(diff z^t_2, diff w_j))  \
    &= - sum_t (frac(r^t, y^t) - frac(1-r^t, 1-y^t)) (frac(diff y^t, diff z^t_1) frac(diff z^t_1, diff w_j) + frac(diff y^t, diff z^t_2) frac(diff z^t_2, diff w_j))  \
    &= - sum_t (frac(r^t-y^t, y^t (1-y^t))) (frac(diff y^t, diff z^t_1) frac(diff z^t_1, diff w_j) + frac(diff y^t, diff z^t_2) frac(diff z^t_2, diff w_j))  \
    &= - sum_t (frac(r^t-y^t, y^t (1-y^t))) (y^t (1-y^t) v_1 frac(diff z^t_1, diff w_j) + y^t (1-y^t) v_2 frac(diff z^t_2, diff w_j))  \
    &= - sum_t (r^t-y^t) (v_1 frac(diff z^t_1, diff w_j) + v_2 frac(diff z^t_2, diff w_j))  \
    &= - sum_t (r^t-y^t) (x^t_j v_1 cases(0 "if" ww dot xx < 0, 1 "otherwise") + x^t_j v_2 (1 - tanh^2 (ww dot xx)))  \
    &= - sum_t (r^t-y^t) x^t_j (v_1 cases(0 "if" ww dot xx < 0, 1 "otherwise") + v_2 (1 - tanh^2 (ww dot xx)))  \
  $

#pagebreak()

= Problem 2a + 2b

For this problem I see a gentle increase in the likelihood value after each of
the E + M steps. There is an issue with the first step but I don't know what
it's caused by.

In general, the higher $k$ was, the more colors there were, and it was able to
produce a better color mapping. The last $k = 12$ had the best "resolution" (not
really resolution since the pixel density didn't change but there are more
detailed shapes).

#image("2a.png")

#pagebreak()

= Problem 2c

For this version, k-means performed a lot better than my initial EM step, even
with a $k$ of 7. I'm suspecting what's happening is that between the EM steps,
the classification of the data changes to spread out inaccurate values, while
with k-means it's always operating on the original data.

#image("2c.png")

#pagebreak()

= Problem 2d

For the $Sigma$ update step, I added this change:

#let rtext(t) = {
  set text(red)
  t
}

$
  Sigma_i &= frac(1, N_i) sum_(t=1)^N gamma(z^t_i) (x^t - u) (x^t - u)^T
  rtext(- frac(lambda, 2) sum_(i=1)^k sum_(j=1)^d (Sigma^(-1)_i)_("jj"))
$

The overall maximum likelihood could not be derived because of the difficulty
with logarithm of sums

= Problem 2e

After implementing this, the result was a lot better. I believe that the
regularization term helps because it makes the $Sigma$s bigger which makes it
converge faster.

#image("2e.png")
hwk 3 2023-11-10 03:29:17 +00:00			`#let dfrac(a, b) = $display(frac(#a, #b))$`

			`= Problem 1a`

			`Given:`

			`#let ww = $bold(w)$`
			`#let xx = $bold(x)$`
			`#let vv = $bold(v)$`
			`#let XX = $bold(X)$`

			`- $E(ww_1,ww_2,vv\|XX) = - sum_t r^t log y^t + (1 - r^t) log(1 - y^t)$`
			`- $y^t = "sigmoid"(v_2 z_2 + v_1 z_1 + v_0)$`
			`- $z^t_1 = "ReLU"(w_(1,2)x^t_2 + w_(1,1)x^t_1 + w_(1,0))$`
			`- $z^t_2 = tanh(w_(2,2)x^t_2 + w_(2,1)x^t_1 + w_(2,0))$`

			`Using the convention $x_(j=1..D)$, $y_(i=1..K)$, and $z_(h=1..H)$.`

			`Solved as:`

			`- $`
			`frac(diff E, diff v_h) &= - sum_t frac(diff E, diff y^t) frac(diff y^t, diff v_h) \`
			`&= - sum_t (r^t dot frac(1, y^t) - (1-r^t) dot frac(1, 1-y^t)) (y^t z^t_h (1-y^t)) \`
			`&= - sum_t (frac(r^t, y^t) - frac(1-r^t, 1-y^t)) (y^t z^t_h (1-y^t)) \`
			`&= - sum_t (frac(r^t (1-y^t)-y^t (1-r^t), cancel(y^t) (1-y^t))) (cancel(y^t) z^t_h (1-y^t)) \`
			`&= - sum_t (frac(r^t - y^t, cancel(1-y^t))) (z^t_h cancel((1-y^t))) \`
			`&= - sum_t (r^t - y^t) z^t_h \`
			`$`

			`- $`
			`frac(diff E, diff w_(1,j)) &= - sum_t frac(diff E, diff y^t) frac(diff y^t, diff z^t_h) frac(diff z^t_h, diff w_(1,j)) \`
fuck this 2023-11-18 08:40:46 +00:00			`&= - sum_t (frac(r^t, y^t) - frac(1-r^t, 1-y^t)) (y^t (1-y^t) v_1) (x_j cases(0 "if" ww_1 dot xx <0, 1 "otherwise")) \`
			`&= - sum_t (r^t - y^t) v_1 x_j cases(0 "if" ww_1 dot xx <0, 1 "otherwise") \`
hwk 3 2023-11-10 03:29:17 +00:00			`$`

			`- $`
			`frac(diff E, diff w_(2,j)) &= - sum_t frac(diff E, diff y^t) frac(diff y^t, diff z^t_h) frac(diff z^t_h, diff w_(2,j)) \`
fuck this 2023-11-18 08:40:46 +00:00			`&= - sum_t (r^t - y^t) v_2 x_j (1-tanh^2(ww_2 dot xx)) \`
hwk 3 2023-11-10 03:29:17 +00:00			`$`

1b progress 2023-11-10 06:42:46 +00:00			`Updates:`

			`- $Delta v_h = eta sum_t (r^t-y^t) z^t_h$`
fuck this 2023-11-18 08:40:46 +00:00			`- $Delta w_(1,j) = eta sum_t (r^t - y^t) v_1 x_j cases(0 "if" ww_1 dot xx <0, 1 "otherwise")$`
			`- $Delta w_(2,j) = eta sum_t (r^t - y^t) v_2 x_j (1-tanh^2(ww_2 dot xx))$`
1b progress 2023-11-10 06:42:46 +00:00
hwk 3 2023-11-10 03:29:17 +00:00			`= Problem 1b`
1b progress 2023-11-10 06:42:46 +00:00
			`- $E(ww,vv\|XX) = - sum_t r^t log y^t + (1 - r^t) log (1 - y^t)$`
1b 2023-11-12 17:42:19 +00:00			`- $y^t = "sigmoid"(v_2 z^t_2 + v_1 z^t_1 + v_0)$`
1b progress 2023-11-10 06:42:46 +00:00			`- $z^t_1 = "ReLU"(w_2 x^t_2 + w_1 x^t_1 + w_0)$`
			`- $z^t_2 = tanh(w_2 x^t_2 + w_1 x^t_1 + w_0)$`

			`Updates:`

			`-`
			`Same as above:`
			$Delta v_h = eta sum_t (r^t-y^t) z^t_h$

			`- $`
			`frac(diff E, diff w_j) &= - sum_t (frac(diff E, diff y^t) frac(diff y^t, diff z^t_1) frac(diff z^t_1, diff w_j)) + (frac(diff E, diff y^t) frac(diff y^t, diff z^t_2) frac(diff z^t_2, diff w_j)) \`
			`&= - sum_t frac(diff E, diff y^t) (frac(diff y^t, diff z^t_1) frac(diff z^t_1, diff w_j) + frac(diff y^t, diff z^t_2) frac(diff z^t_2, diff w_j)) \`
			`&= - sum_t (frac(r^t, y^t) - frac(1-r^t, 1-y^t)) (frac(diff y^t, diff z^t_1) frac(diff z^t_1, diff w_j) + frac(diff y^t, diff z^t_2) frac(diff z^t_2, diff w_j)) \`
			`&= - sum_t (frac(r^t-y^t, y^t (1-y^t))) (frac(diff y^t, diff z^t_1) frac(diff z^t_1, diff w_j) + frac(diff y^t, diff z^t_2) frac(diff z^t_2, diff w_j)) \`
1b 2023-11-12 17:42:19 +00:00			`&= - sum_t (frac(r^t-y^t, y^t (1-y^t))) (y^t (1-y^t) v_1 frac(diff z^t_1, diff w_j) + y^t (1-y^t) v_2 frac(diff z^t_2, diff w_j)) \`
			`&= - sum_t (r^t-y^t) (v_1 frac(diff z^t_1, diff w_j) + v_2 frac(diff z^t_2, diff w_j)) \`
			`&= - sum_t (r^t-y^t) (x^t_j v_1 cases(0 "if" ww dot xx < 0, 1 "otherwise") + x^t_j v_2 (1 - tanh^2 (ww dot xx))) \`
			`&= - sum_t (r^t-y^t) x^t_j (v_1 cases(0 "if" ww dot xx < 0, 1 "otherwise") + v_2 (1 - tanh^2 (ww dot xx))) \`
working kinda? 2023-11-17 02:23:38 +00:00			`$`

			`#pagebreak()`

			`= Problem 2a + 2b`

fuck this 2023-11-18 08:40:46 +00:00			`For this problem I see a gentle increase in the likelihood value after each of`
			`the E + M steps. There is an issue with the first step but I don't know what`
			`it's caused by.`

			`In general, the higher $k$ was, the more colors there were, and it was able to`
			`produce a better color mapping. The last $k = 12$ had the best "resolution" (not`
			`really resolution since the pixel density didn't change but there are more`
			`detailed shapes).`

working kinda? 2023-11-17 02:23:38 +00:00			`#image("2a.png")`

fuck this 2023-11-18 08:40:46 +00:00			`#pagebreak()`

working kinda? 2023-11-17 02:23:38 +00:00			`= Problem 2c`

fuck this 2023-11-18 08:40:46 +00:00			`For this version, k-means performed a lot better than my initial EM step, even`
			`with a $k$ of 7. I'm suspecting what's happening is that between the EM steps,`
			`the classification of the data changes to spread out inaccurate values, while`
			`with k-means it's always operating on the original data.`

			`#image("2c.png")`

			`#pagebreak()`

working kinda? 2023-11-17 02:23:38 +00:00			`= Problem 2d`

fuck this 2023-11-18 08:40:46 +00:00			`For the $Sigma$ update step, I added this change:`

			`#let rtext(t) = {`
			`set text(red)`
			`t`
			`}`

			`$`
			`Sigma_i &= frac(1, N_i) sum_(t=1)^N gamma(z^t_i) (x^t - u) (x^t - u)^T`
			`rtext(- frac(lambda, 2) sum_(i=1)^k sum_(j=1)^d (Sigma^(-1)_i)_("jj"))`
			`$`

			`The overall maximum likelihood could not be derived because of the difficulty`
			`with logarithm of sums`

			`= Problem 2e`

			`After implementing this, the result was a lot better. I believe that the`
			`regularization term helps because it makes the $Sigma$s bigger which makes it`
			`converge faster.`

			`#image("2e.png")`