== Step 2-4

For steps 2-4, I calculated all of each process' outgoing nodes, sorted it in
order and used its sorted position as a way to identify which nodes are being
sent.

This saves an extra communication and lets me index the same items for each
loop.

== Step 5

I exchanged data using the unstructured communication approach, doing an
all-to-all transfer.

To read the result efficiently, I tried using the approach given in the slides.
I also tried to use binary search since this would yield $log(n)$ time.
However, this was taking a long time (up to 45 seconds for the 10,000 case), and
it was the bottleneck.  Using STL's `std::map` proved to be orders of magnitude
faster.

== Other remarks

On the original example dataset, it poorly using larger numbers. I have an
explanation for this after looking at the performance characteristics of the
run: it completes in one iteration where every single edge is assigned. The data
distribution also indicates that almost everything is connected into the first
node, which isn't balanced.

I've written a generation script in Python using the `igraph` library.

- 1,000: 93 components
- 10,000: 947 components
- 100,000: 9,423 components
- 1,000,000: 92,880 components

Using this data, I was able to achieve much better speedup. I didn't attach the
actual data files but they can be generated with the same script (seeded for
reproducibility).

*NOTE:* I noticed that afterwards, the data was changed again, with a more balanced graph this time.
So the numbers will not reflect the poorer performance.

== Timing on example dataset

This experiment was performed on CSELabs by using my bench script, and the table
was generated with another script.

#table(
  columns: (auto, auto, auto, auto, auto, auto),
  [],  [1], [2], [4], [8], [16] ,
  [1000],
  [0.0249s #linebreak() 0.0151s],
  [0.0234s #linebreak() 0.0122s],
  [0.0206s #linebreak() 0.0099s],
  [0.0491s #linebreak() 0.0248s],
  [0.0177s #linebreak() 0.0106s],

  [10000],
  [0.2929s #linebreak() 0.1830s],
  [0.2933s #linebreak() 0.1540s],
  [0.2457s #linebreak() 0.1178s],
  [0.3793s #linebreak() 0.1328s],
  [0.2473s #linebreak() 0.1197s],

  [100000],
  [3.7888s #linebreak() 2.4881s],
  [3.7592s #linebreak() 2.0212s],
  [3.3819s #linebreak() 1.6036s],
  [2.9485s #linebreak() 1.3954s],
  [2.8593s #linebreak() 1.3107s],

  [1000000],
  [46.7895s #linebreak() 31.9648s],
  [45.2284s #linebreak() 24.8540s],
  [40.3994s #linebreak() 20.2851s],
  [36.9628s #linebreak() 17.6794s],
  [35.7110s #linebreak() 16.6276s],

)