csci5451/assignments/01/ASSIGNMENT.md
2023-09-25 09:32:31 -05:00

6.9 KiB

Introduction

The purpose of this assignment is to become familiar with Pthreads and OpenMP by parallelizing a common algorithm in data mining: Linear classification with squared loss.

Linear classification

The algorithm is described here.

You need to write two programs, called lc_pthreads and lc_openmp. Each program will accept four parameters as input: a filename of data points, a filename of data labels, number of outer iterations, and number of threads. Your programs should read in the data, run the classification algorithm, print your classification model loss to the standard output after each outer iteration (explained in the algorithm), and finally print the time taken by your program to the standard output at the end.

Dataset

The dataset you'll use in this project is derived from the MNIST dataset. Since you'll perform binary classification in this project, the dataset here corresponds to pixel values of 28X28 images of handwritten digits corresponding to digits '1' and '7'. In the label file provided, the label corresponding to digit '1' is 1, while for '7', it is -1.

Input/Output Formats

Each program will take as input two files (the list of data points, and the list of labels).

The input file with data points contains N+1 lines. The first line contains two space-separated integers: the number of data points (N), and the dimensionality of each data point (D). The following N lines each contain D space-separated integerss which represent the coordinates of the current data point. For example, an input with four two-dimensional data points would be stored in a file as:

4 2
1 1
1 2
3 4
3 2

The input file with class labels contains N+1 lines. The first line contains the number of data points (N). The following N lines each contains one integer each, the class label of each data-point. For example, an input with four data points would be stored in a file as:

4
1
-1
1
-1

Your program must also print the execution time to standard output. You should use a high-precision, monotonic, wall-clock timer and also omit the time spent reading the files. We recommend the function clock_gettime() when on a Linux system. You will be graded on a Linux system. Here is a function that you may use for timing:

/* Gives us high-resolution timers. */
#define _POSIX_C_SOURCE 199309L
#include <time.h>

/**
* @brief Return the number of seconds since an unspecified time (e.g., Unix
*        epoch). This is accomplished with a high-resolution monotonic timer,
*        suitable for performance timing.
*
* @return The number of seconds.
*/
static inline double monotonic_seconds()
{
  struct timespec ts;
  clock_gettime(CLOCK_MONOTONIC, &ts);
  return ts.tv_sec + ts.tv_nsec * 1e-9;
}

You should use the following function to output your execution time:

/**
* @brief Output the seconds elapsed while execution.
*
* @param seconds Seconds spent on execution, excluding IO.
*/
static void print_time(double const seconds)
{
  printf("Execution time: %0.04fs\n", seconds);
}

Failure to follow any of these output instructions will result in significant loss of points.

Testing

Data files are also provided on phi0{1..8}.cselabs.umn.edu at /export/scratch/CSCI5451_F23/assignment-1/dataset. The data file is named MNIST_data.csv and the label file is named MNIST_label.csv.

You should only run your code on one of the phi machines. Use the last digit of your student ID to select the machine. For example, if your student ID is "1234567", use phi07. If your ID ends with 0 or 9, use the second-to-last digit.

We also provide one small test case (two-dimensional dataset) you can use to verify the correctness of your code: The corresponding files are small_data.csv and small_label.csv. Coordinate descent over this dataset should be able to converge in less than 10 iterations and the resultant w vector should be something like [0.394713 0.550215].

Remember that the TA will be evaluating your data with a different data sets than those provided for testing.

What you need to turn in

  1. The source code of your programs.
  2. A short report including the following two parts:
    1. A short description of how you went about parallelizing the classification algorithm. You should include how you decomposed the problem and why, i.e., what were the tasks being parallelized.
    2. Timing results for 1, 2, 4, 8, and 16 threads for the classification. You should include results with outer iterations set to 10.

Do NOT include the test files; TA will have their own test files for grading. You will lose significant points for including test files.

Additional specifications related to assignment submission

  • A makefile must be provided to compile and generate the two executables. The executables should be named:

    • lc_pthreads
    • lc_openmp
  • Program invocation: Your programs should take as an argument the input file containing data points, file containing labels, the number of outer iterations, and the number of threads to be used for parallel execution. For example, with ten outer iterations and two threads, each of the programs would be invoked as follows:

    ./lc_pthreads /export/scratch/CSCI5451-F23/assignment-1/data.csv /export/scratch/CSCI5451-F23/assignment-1/data.label 10 2
    ./lc_openmp /export/scratch/CSCI5451-F23/assignment-1/data.csv /export/scratch/CSCI5451-F23/assignment-1/data.label 10 2
    
  • All files (code + report) MUST be in a single directory and the directory's name MUST be your UMN login ID (e.g., your UMN email or Moodle username). Your submission directory MUST include at least the following files (other auxiliary files may also be included):

    <UMN ID>/lc_pthreads.c
    <UMN ID>/lc_openmp.c
    <UMN ID>/Makefile
    <UMN ID>/report.pdf
    

If you choose to code in C++, then replace the .c suffixes with .cpp or .cc.

  • Submission MUST be in .tar.gz

  • The following sequence of commands should work on your submission file:

    tar xzvf <UMN ID>.tar.gz
    cd <UMN ID>
    make
    ls -ld lc_pthreads
    ls -ld lc_openmp
    

This ensures that your submission is packaged correctly, your directory is named correctly, your makefile works correctly, and your output executables are named correctly. If any of these does not work, modify it so that you do not lose points. The TAs can answer questions about correctly formatting your submission BEFORE the assignment is due. Do not expect them to answer questions the night it is due.

Failure to follow any of these submission instructions will result in significant loss of points.

Evaluation criteria

The goal for this assignment is for you to become familiar with the APIs and not so much for developing the most efficient parallel program (this will be done later). As such, full points will be given to the programs that:

  1. follows the assignment directions;
  2. solve the problem correctly;
  3. do so in parallel (i.e., sub-steps that can be parallelized are parallelized); and
  4. achieves good speed-up.