csci5451/assignments/04/report.typ

#set page(margin: 1.5em)

== Homework 4

Michael Zhang \<zhan4854\@umn.edu\>

1. *A short description of how you went about parallelizing the k-means algorithm. You should include how you decomposed the problem and why, i.e., what were the tasks being parallelized.*

  My parallelized program included the following procedures:

  - `findDistanceToCentroid` - This computes an $N times K$ matrix of distances from each data point to each centroid.

  - `assignClosestCentroid` - This reduces the distances to find the minimum distance for each centroid, and assigns the closest one to an $N times 1$ vector.

  - `recentralizeCentroidSum` - This computes a sum for the purposes of averaging, and also counts the number of elements in each cluster.

  - `recentralizeCentroidDiv` - This uses the count from the previous step and divides everything in parallel.

  I tried to make sure every thread is computing approximately one single for-loop's worth of data, most of the time over the $d$ axis

2. *Give details about how many elements and how the computations in your kernels are handled by a thread.*

  I used the dynamic thread allocation method based on the size of the data.

  For most of the kernels, the computation is very simple: perform a row-reduction into a different array. Since all the accesses are disjoint, I don't synchronize between threads.

  However, for averaging the datapoints, I unfortunately need to run a $N times K times D$ operation that involves a sum reduction. I tried using a tree-based approach after doing some bitwise math to avoid the conditional of whether it's in the same class, but the plain approach is simpler and I did not get the other one to work.

3. *Ensure you include details about the thread hierarchy, i.e., whether the threads are organized in a 1D, 2D, or, 3D fashion in a thread-block, and whether the thread-blocks are arranged 1D, 2D, or, 3D grid. NOTE: If you choose to write CUDA kernels where the number of thread blocks is determined dynamically by the program during runtime, then send -1 as the input argument for the number of thread blocks to the invocation. In your program, use -1 as a flag to indicate that the number of thread blocks will need to be computed during runtime.*

  I used a 1D thread hierarchy. This is because all my accesses are already basically along the "good" axis, so I'm not doing any strides along other dimensions.

4. *You need to perform a parameter study in order to determine how the number of elements processed by a thread and the size of a thread-block, i.e., the \# threads in a block, affect the performance of your algorithm. Your writeup should contain some results showing the runtime that you obtained for different choices.*

  I tried using a fixed number of thread blocks, but for the 256 cluster case it ended up performing almost twice as slow (at 0.4516s). To the best of my knowledge, CUDA is able to schedule the extra kernel instances it needs to compute to happen sequentially after, if there are more thread blocks requested than there are available.

5. *You should include results on the 'large_cpd.txt' dataset with 256, 512, and 1024 clusters.*

  - 256: 26.8258s
  - 512: 62.1212s
  - 1024: 163.4022s