csci5521/assignments/hwk02/HW2.typ

#set document(
    title: "Assignment 1",
    author: "Michael Zhang <zhan4854@umn.edu>",
)

#let c(body) = {
  set text(gray)
  body
}
#let boxed(body) = {
  box(stroke: 0.5pt, inset: 2pt, outset: 2pt, baseline: 0pt, body)
}

1. #c[*(50 points)* In this problem, you will implement a program to fit two multivariate Gaussian distributions to the 2-class data and classify the test data by computing the log odds $log frac(P(C_1|x), P (C_2|x))$ (equivalent to comparing discriminant functions). Three pairs of training data and test data are given. The priors $P (C_1)$ and $P (C_2)$ are estimated from the training data. The parameters $mu_1$, $mu_2$, $S_1$ and $S_2$, the mean and covariance for class 1 and class 2, are learned in the following three models for each training data and test data pair]

  - #c[Model 1: Assume independent $S_1$ and $S_2$ (the discriminant function is as equation (5.17) in the textbook).]

  - #c[Model 2: Assume $S_1$ = $S_2$. In other words, shared S between two classes (the discriminant function is as equation (5.21) and (5.22) in the textbook).]

  - #c[Model 3: Assume $S_1$ and $S_2$ are diagonal (the Naive Bayes' model in equation (5.24)).]

  a. #c[*(30 points)* Implement all the three models and test your program on the three pairs of training data and test data. The main script function, Problem 1 (training data file,test data file) is given and this script should not be modified. There are 3 scripts that need to be completed for Problem 1 (`Error_Rate.m`, `Param_Est.m`, `Classify.m`). The _TODO_: comment headers must be filled in in all 3 of these files. These _TODO_ headers describe exactly what code needs to be written to obtain full credit. The script `Error_Rate.m` is for calculating the error rate. `Param_Est.m` is for estimating the parameters of each multivariante Gaussian distribution under the 3 different models. `Classify.m` is for classify the test data using the learned models. For each test dataset, the problem calls several functions and print out the training error rate and test error rate of each model to the MATLAB command window.]

  b. #c[*(5 points)* State which model works best on each test data set and explain why you believe this is the case. Discuss your observations.]

  c. #c[*(15 points)* Write the log likelihood function and derive $S_1$ and $S_2$ by maximum likelihood estimation of model 2. Note that since $S_1$ and $S_2$ are shared as $S$, you need to add the log likelihood function of the two classes to maximizing for deriving $S$.]

2. #c[*(50 points)* In this problem, you will work on dimension reduction and classification on a Faces dataset from the UCI repository. We provided the processed files `face_train_data_960.txt` and `face_test_data_960.txt` with 500 and 124 images, respectively. Each image is of size 30 #sym.times 32 with the pixel values in a row in the files and the last column identifies the labels: 1 (sunglasses), and 0 (open) of the image. You can visualize the $i$th image with the following matlab command line:]

  ```matlab
  imagesc(reshape(faces_data(i,1:end-1),32,30)')
  ```
  
  #c[The main script function, Problem2(training_data_file, test_data_file) is given and this script should not be modified. There are 5 scripts that need to be completed for Problem 2 (`Eigenfaces.m`, `ProportionOfVariance.m`, `KNN.m` `KNN_Error.m`, `Back_Project.m`). The _TODO_: comment headers must be filled in in all 5 of these files. These TODO comment headers describe exactly what code needs to be written to obtain full credit.]

  a. #c[*(10 points)* Apply Principal Component Analysis (PCA) to find the principal components with combined training and test sets. First, visualize the first 5 eigen-faces using a similar command line as above. This can be accomplished by completing the _TODO_ comment headers in the `Eigenfaces.m` script.]

  b. #c[*(20 points)* Generate a plot of proportion of variance (see Figure 6.4 (b) in the main textbook) on the training data, and select the minimum number ($K$) of eigenvectors that explain at least 90% of the variance. Show both the plot and $K$ in the report. This can be accomplished by completing the _TODO_ headers in the `ProportionOfVariance.m` script. Project the training and test data to the $K$ principal components and run KNN on the projected data for $k = {1, 3, 5, 7}$. Print out the error rate on the test set. Implement your own version of and K-Nearest Neighbor classifier (KNN) for this problem. Classify each test point using a majority rule i.e., by choosing the most common class among the $k$ training points it is closest to. In the case where two classes are equally as frequent, perform a tie-breaker by choosing whichever class has on average a smaller distance to the test point. This can be accomplished by completing the _TODO_ comment headers in the `KNN.m` and `KNN_Error.m` scripts.]

  c. #c[*(20 points)* Use the first $K = {10, 50, 100}$ principle components to approximate the first five images of the training set (first row of the data matrix) by projecting the centered data using the first $K$ principal components then "back project" (weighted sum of the components) to the original space and add the mean. For each $K$, plot the reconstructed image. This can be accomplished by completing the _TODO_ comment headers in the `Back_Project.m` script. Explain your observations in the report.]
hw2 2023-10-21 08:23:26 +00:00			`#set document(`
			`title: "Assignment 1",`
			`author: "Michael Zhang <zhan4854@umn.edu>",`
			`)`

			`#let c(body) = {`
			`set text(gray)`
			`body`
			`}`
			`#let boxed(body) = {`
			`box(stroke: 0.5pt, inset: 2pt, outset: 2pt, baseline: 0pt, body)`
			`}`

			1. #c[(50 points) In this problem, you will implement a program to fit two multivariate Gaussian distributions to the 2-class data and classify the test data by computing the log odds $log frac(P(C_1\|x), P (C_2\|x))$ (equivalent to comparing discriminant functions). Three pairs of training data and test data are given. The priors $P (C_1)$ and $P (C_2)$ are estimated from the training data. The parameters $mu_1$, $mu_2$, $S_1$ and $S_2$, the mean and covariance for class 1 and class 2, are learned in the following three models for each training data and test data pair]

			`- #c[Model 1: Assume independent $S_1$ and $S_2$ (the discriminant function is as equation (5.17) in the textbook).]`

			`- #c[Model 2: Assume $S_1$ = $S_2$. In other words, shared S between two classes (the discriminant function is as equation (5.21) and (5.22) in the textbook).]`

			`- #c[Model 3: Assume $S_1$ and $S_2$ are diagonal (the Naive Bayes' model in equation (5.24)).]`

			a. #c[(30 points) Implement all the three models and test your program on the three pairs of training data and test data. The main script function, Problem 1 (training data file,test data file) is given and this script should not be modified. There are 3 scripts that need to be completed for Problem 1 (`Error_Rate.m`, `Param_Est.m`, `Classify.m`). The _TODO_: comment headers must be filled in in all 3 of these files. These _TODO_ headers describe exactly what code needs to be written to obtain full credit. The script `Error_Rate.m` is for calculating the error rate. `Param_Est.m` is for estimating the parameters of each multivariante Gaussian distribution under the 3 different models. `Classify.m` is for classify the test data using the learned models. For each test dataset, the problem calls several functions and print out the training error rate and test error rate of each model to the MATLAB command window.]

			`b. #c[(5 points) State which model works best on each test data set and explain why you believe this is the case. Discuss your observations.]`

			`c. #c[(15 points) Write the log likelihood function and derive $S_1$ and $S_2$ by maximum likelihood estimation of model 2. Note that since $S_1$ and $S_2$ are shared as $S$, you need to add the log likelihood function of the two classes to maximizing for deriving $S$.]`

			2. #c[(50 points) In this problem, you will work on dimension reduction and classification on a Faces dataset from the UCI repository. We provided the processed files `face_train_data_960.txt` and `face_test_data_960.txt` with 500 and 124 images, respectively. Each image is of size 30 #sym.times 32 with the pixel values in a row in the files and the last column identifies the labels: 1 (sunglasses), and 0 (open) of the image. You can visualize the $i$th image with the following matlab command line:]

			```matlab
			`imagesc(reshape(faces_data(i,1:end-1),32,30)')`
			```

			#c[The main script function, Problem2(training_data_file, test_data_file) is given and this script should not be modified. There are 5 scripts that need to be completed for Problem 2 (`Eigenfaces.m`, `ProportionOfVariance.m`, `KNN.m` `KNN_Error.m`, `Back_Project.m`). The _TODO_: comment headers must be filled in in all 5 of these files. These TODO comment headers describe exactly what code needs to be written to obtain full credit.]

			a. #c[(10 points) Apply Principal Component Analysis (PCA) to find the principal components with combined training and test sets. First, visualize the first 5 eigen-faces using a similar command line as above. This can be accomplished by completing the _TODO_ comment headers in the `Eigenfaces.m` script.]

			b. #c[(20 points) Generate a plot of proportion of variance (see Figure 6.4 (b) in the main textbook) on the training data, and select the minimum number ($K$) of eigenvectors that explain at least 90% of the variance. Show both the plot and $K$ in the report. This can be accomplished by completing the _TODO_ headers in the `ProportionOfVariance.m` script. Project the training and test data to the $K$ principal components and run KNN on the projected data for $k = {1, 3, 5, 7}$. Print out the error rate on the test set. Implement your own version of and K-Nearest Neighbor classifier (KNN) for this problem. Classify each test point using a majority rule i.e., by choosing the most common class among the $k$ training points it is closest to. In the case where two classes are equally as frequent, perform a tie-breaker by choosing whichever class has on average a smaller distance to the test point. This can be accomplished by completing the _TODO_ comment headers in the `KNN.m` and `KNN_Error.m` scripts.]

			c. #c[(20 points) Use the first $K = {10, 50, 100}$ principle components to approximate the first five images of the training set (first row of the data matrix) by projecting the centered data using the first $K$ principal components then "back project" (weighted sum of the components) to the original space and add the mean. For each $K$, plot the reconstructed image. This can be accomplished by completing the _TODO_ comment headers in the `Back_Project.m` script. Explain your observations in the report.]