csci5521/assignments/hwk02/HW2.typ

135 lines
9.7 KiB
Text
Raw Normal View History

2023-10-21 08:23:26 +00:00
#set document(
title: "Assignment 1",
author: "Michael Zhang <zhan4854@umn.edu>",
)
#let c(body) = {
set text(gray)
body
}
#let boxed(body) = {
box(stroke: 0.5pt, inset: 2pt, outset: 2pt, baseline: 0pt, body)
}
1. #c[*(50 points)* In this problem, you will implement a program to fit two multivariate Gaussian distributions to the 2-class data and classify the test data by computing the log odds $log frac(P(C_1|x), P (C_2|x))$ (equivalent to comparing discriminant functions). Three pairs of training data and test data are given. The priors $P (C_1)$ and $P (C_2)$ are estimated from the training data. The parameters $mu_1$, $mu_2$, $S_1$ and $S_2$, the mean and covariance for class 1 and class 2, are learned in the following three models for each training data and test data pair]
- #c[Model 1: Assume independent $S_1$ and $S_2$ (the discriminant function is as equation (5.17) in the textbook).]
- #c[Model 2: Assume $S_1$ = $S_2$. In other words, shared S between two classes (the discriminant function is as equation (5.21) and (5.22) in the textbook).]
- #c[Model 3: Assume $S_1$ and $S_2$ are diagonal (the Naive Bayes' model in equation (5.24)).]
a. #c[*(30 points)* Implement all the three models and test your program on the three pairs of training data and test data. The main script function, Problem 1 (training data file,test data file) is given and this script should not be modified. There are 3 scripts that need to be completed for Problem 1 (`Error_Rate.m`, `Param_Est.m`, `Classify.m`). The _TODO_: comment headers must be filled in in all 3 of these files. These _TODO_ headers describe exactly what code needs to be written to obtain full credit. The script `Error_Rate.m` is for calculating the error rate. `Param_Est.m` is for estimating the parameters of each multivariante Gaussian distribution under the 3 different models. `Classify.m` is for classify the test data using the learned models. For each test dataset, the problem calls several functions and print out the training error rate and test error rate of each model to the MATLAB command window.]
2023-10-25 10:20:44 +00:00
```
2023-10-25 13:58:03 +00:00
>> AllProblem1
Dataset 1:
2023-11-10 03:29:17 +00:00
Model 1: (train err = 5%), (test error = 20%)
Model 2: (train err = 6%), (test error = 17%)
Model 3: (train err = 7%), (test error = 18%)
2023-10-25 13:58:03 +00:00
Dataset 2:
2023-11-10 03:29:17 +00:00
Model 1: (train err = 7%), (test error = 23%)
Model 2: (train err = 14%), (test error = 56%)
Model 3: (train err = 13%), (test error = 53%)
2023-10-25 13:58:03 +00:00
Dataset 3:
2023-11-10 03:29:17 +00:00
Model 1: (train err = 1%), (test error = 12%)
Model 2: (train err = 19%), (test error = 45%)
Model 3: (train err = 2%), (test error = 5%)
2023-10-25 10:20:44 +00:00
```
2023-10-21 08:23:26 +00:00
b. #c[*(5 points)* State which model works best on each test data set and explain why you believe this is the case. Discuss your observations.]
2023-10-25 13:58:03 +00:00
It's actually interesting that each dataset seems to have a different model as its most successful model:
- For dataset 1, model 2 worked the best.
- For dataset 2, model 1 worked the best.
- For dataset 3, model 3 worked the best.
2023-11-10 03:29:17 +00:00
The separate covariance matrices work better when each class has its own individual dimensions, while the shared matrix works better when they're closer. The third model will work better when there's not a lot of data and the dependency may be inaccurate.
2023-10-21 08:23:26 +00:00
c. #c[*(15 points)* Write the log likelihood function and derive $S_1$ and $S_2$ by maximum likelihood estimation of model 2. Note that since $S_1$ and $S_2$ are shared as $S$, you need to add the log likelihood function of the two classes to maximizing for deriving $S$.]
2023-11-10 03:29:17 +00:00
Starting with priors, $P(C_1) = frac(N_1, N)$ and $P(C_2) = frac(N_2, N)$.
Supposing individual sample mean is $m_1 = frac(sum_t r_1^t x_1^t, N_1)$ and $m_2 = frac(sum_t r_2^t x_2^t, N_2)$, the combined mean $m$ can be found by
$ bold(m) = m_1 p(C_1) + m_2 p(C_2) $
Then, for the final $S$, we have:
#let wtf = $m_1 p(C_1) + m_2 p(C_2)$
$ s_i &= frac(1, N) sum_t (bold(x)^t - bold(m)) (bold(x)^t - bold(m))^T \
&= frac(1, N) sum_t (bold(x)^t (bold(x)^t)^T - bold(x)^t bold(m)^T - bold(m) (bold(x)^t)^T + bold(m) bold(m)^T) \
$
In our $S_i$, we have:
$ s_j &= frac(1, N_i) sum_t^N_i (bold(x)^t - bold(m)_i) (bold(x)^t - bold(m)_i)^T \
$
When we add that together as a weighted sum we get:
$ s_j &= frac(1, N) (sum_t^N_1 (bold(x)^t - bold(m)_1) (bold(x)^t - bold(m)_1)^T + sum_t^N_2 (bold(x)^t - bold(m)_2) (bold(x)^t - bold(m)_2)^T) \
&= frac(1, N) (sum_t^N bold(x)^t (bold(x)^t)^T - (sum_t^N_1 bold(x)^t bold(m)_1^T + sum_t^N_2 bold(x)^t bold(m)_2^T) - (sum_t^N_1 bold(m)_1 bold(x)^t + sum_t^N_2 bold(m)_2 bold(x)^t) + (sum_t^N_1 bold(m)_1 bold(m)_1^T + sum_t^N_2 bold(m)_2 bold(m)_2^T)) \
&= frac(1, N) (sum_t^N bold(x)^t (bold(x)^t)^T - bold(x)^t (sum_t^N_1 bold(m)_1^T + sum_t^N_2 bold(m)_2^T) - (sum_t^N_1 bold(m)_1 + sum_t^N_2 bold(m)_2) bold(x)^t + (sum_t^N_1 bold(m)_1 bold(m)_1^T + sum_t^N_2 bold(m)_2 bold(m)_2^T)) \
&= frac(1, N) sum_t^N (bold(x)^t (bold(x)^t)^T - bold(x)^t bold(m)^T - bold(m) (bold(x)^t)^T + bold(m) bold(m)^T) \
$
which matches the above equation
// Then:
// - for the sample covariance $S_1$, $s_(i i) = frac(sum_t (x_i^t - m_i)^2, P(C_1)) = frac(sum_t (x_i^t)^2 - 2x_i^t m_i + m_i^2, P(C_1))$ (with mean drawing from $m_1$)
// - for the sample covariance $S_2$, $s_(i i) = frac(sum_t (x_i^t - m_i)^2, P(C_2)) = frac(sum_t (x_i^t)^2 - 2x_i^t m_i + m_i^2, P(C_2))$ (with a different mean drawing from $m_2$)
// the overall covariance is $s_(i j) = frac(sum_t (x_i^t - m_1)(x_j^t - m_2), N)$
// deriving from $S_1$ and $S_2$ you get:
// $ s_(i j) &= frac(1,N) ((S_1)_(i j) P(C_1) + (S_2)_(i j) P(C_2)) \
// &= frac(1,N) (sum_t (x_i^t)^2 - 2x_i^t (m_1)_i + (m_1)_i^2 + (x_i^t)^2 - 2x_i^t (m_2)_i + (m_2)_i^2) \
// &= frac(1,N) (sum_t 2(x_i^t)^2 - 2x_i^t ((m_1)_i + (m_2)_i) + (m_1)_i^2 + (m_2)_i^2) \
// $
// Combining covariance matrixes:
// $ bold(S) &= frac(sum_t (bold(x)^t - bold(m)_1) (bold(x)^t - bold(m)_2)^T, N) \
// &= frac(sum_t bold(x)^t^T bold(x)^t - bold(x)^t bold(m)_2 - bold(m)_1 bold(x)^t + bold(m)_1 bold(m)_2, N) \
// $
// // The maximum likelihood of a single class can be found with:
// // $ theta^* &= "argmax"_theta cal(L) (theta|bold(X)) \
// // frac(diff, diff theta) cal(L) (theta|bold(X)) &= 0 \
// // frac(diff, diff theta) log(cal(l) (theta|bold(X))) &= 0 \
// // frac(diff, diff theta) sum_t log(p(x^t|theta)) &= 0 \
// // frac(diff, diff theta) sum_t log(frac(1,(2 pi)^(d/2) |bold(Sigma)_i|^(1/2)) exp(-frac(1,2)(bold(x)-bold(mu)_i)^T bold(Sigma)_i^(-1) (bold(x)-bold(mu)_i))) &= 0 \
// // $
2023-10-25 13:32:53 +00:00
2023-10-21 08:23:26 +00:00
2. #c[*(50 points)* In this problem, you will work on dimension reduction and classification on a Faces dataset from the UCI repository. We provided the processed files `face_train_data_960.txt` and `face_test_data_960.txt` with 500 and 124 images, respectively. Each image is of size 30 #sym.times 32 with the pixel values in a row in the files and the last column identifies the labels: 1 (sunglasses), and 0 (open) of the image. You can visualize the $i$th image with the following matlab command line:]
```matlab
imagesc(reshape(faces_data(i,1:end-1),32,30)')
```
#c[The main script function, Problem2(training_data_file, test_data_file) is given and this script should not be modified. There are 5 scripts that need to be completed for Problem 2 (`Eigenfaces.m`, `ProportionOfVariance.m`, `KNN.m` `KNN_Error.m`, `Back_Project.m`). The _TODO_: comment headers must be filled in in all 5 of these files. These TODO comment headers describe exactly what code needs to be written to obtain full credit.]
a. #c[*(10 points)* Apply Principal Component Analysis (PCA) to find the principal components with combined training and test sets. First, visualize the first 5 eigen-faces using a similar command line as above. This can be accomplished by completing the _TODO_ comment headers in the `Eigenfaces.m` script.]
2023-10-22 06:04:22 +00:00
#figure(image("images/eigenfaces.png"))
2023-10-21 08:23:26 +00:00
b. #c[*(20 points)* Generate a plot of proportion of variance (see Figure 6.4 (b) in the main textbook) on the training data, and select the minimum number ($K$) of eigenvectors that explain at least 90% of the variance. Show both the plot and $K$ in the report. This can be accomplished by completing the _TODO_ headers in the `ProportionOfVariance.m` script. Project the training and test data to the $K$ principal components and run KNN on the projected data for $k = {1, 3, 5, 7}$. Print out the error rate on the test set. Implement your own version of and K-Nearest Neighbor classifier (KNN) for this problem. Classify each test point using a majority rule i.e., by choosing the most common class among the $k$ training points it is closest to. In the case where two classes are equally as frequent, perform a tie-breaker by choosing whichever class has on average a smaller distance to the test point. This can be accomplished by completing the _TODO_ comment headers in the `KNN.m` and `KNN_Error.m` scripts.]
2023-10-22 06:04:22 +00:00
#figure(image("images/prop_var.png"))
2023-10-21 21:42:47 +00:00
2023-10-22 06:04:22 +00:00
I used $K = 41$.
2023-10-21 21:42:47 +00:00
2023-10-25 13:32:53 +00:00
c. #c[*(20 points)* Use the first $K = {10, 50, 100}$ principle components to approximate the first five images of the training set (first row of the data matrix) by projecting the centered data using the first $K$ principal components then "back project" (weighted sum of the components) to the original space and add the mean. For each $K$, plot the reconstructed image. This can be accomplished by completing the _TODO_ comment headers in the `Back_Project.m` script. Explain your observations in the report.]
This "back-projection" seems like a "low-resolution" version of the image, but in particular with lower $K$, it focuses on what the model considers to be the most varied features.
This is why for $K=10$, most of the faces were not really visible, only blobs that were vaguely face-shaped, along with the shirt. But with higher $K$, more detail was given to some of the other features of the background.