== Homework 4

Michael Zhang \<zhan4854\@umn.edu\>

1. *A short description of how you went about parallelizing the k-means algorithm. You should include how you decomposed the problem and why, i.e., what were the tasks being parallelized.*

  My parallelized program included the following procedures:

  - `findDistanceToCentroid` - This computes an $N times K$ matrix of distances from each data point to each centroid.

  - `assignClosestCentroid` - This reduces the distances to find the minimum distance for each centroid, and assigns the closest one to an $N times 1$ vector.

  - `recentralizeCentroidSum` - This computes a sum for the purposes of averaging, and also counts the number of elements in each cluster.

  - `recentralizeCentroidDiv` - This uses the count from the previous step and divides everything in parallel.

  I tried to make sure every thread is computing approximately one single for-loop's worth of data, most of the time over the $d$ axis

2. *Give details about how many elements and how the computations in your kernels are handled by a thread.*

  I used the dynamic thread allocation method based on the size of the data.

  For most of the kernels, the computation is very simple: perform a row-reduction into a different array. Since all the accesses are disjoint, I don't synchronize between threads.

  However, for averaging the datapoints, I unfortunately need to run a $N times K times D$ operation that involves a sum reduction. I tried using a tree-based approach after doing some bitwise math to avoid the conditional of whether it's in the same class, but the plain approach is simpler and I did not get the other one to work.

3. *Ensure you include details about the thread hierarchy, i.e., whether the threads are organized in a 1D, 2D, or, 3D fashion in a thread-block, and whether the thread-blocks are arranged 1D, 2D, or, 3D grid. NOTE: If you choose to write CUDA kernels where the number of thread blocks is determined dynamically by the program during runtime, then send -1 as the input argument for the number of thread blocks to the invocation. In your program, use -1 as a flag to indicate that the number of thread blocks will need to be computed during runtime.*

  I used a 1D thread hierarchy. This is because all my accesses are already basically along the "good" axis, so I'm not doing any strides along other dimensions.

4. *You need to perform a parameter study in order to determine how the number of elements processed by a thread and the size of a thread-block, i.e., the \# threads in a block, affect the performance of your algorithm. Your writeup should contain some results showing the runtime that you obtained for different choices.*

5. *You should include results on the 'large_cpd.txt' dataset with 256, 512, and 1024 clusters.*

  - 256: 26.8258s
  - 512: 62.1212s
  - 1024: 163.4022s