This version of the code implements, via conditional compilation, multiple versions of the k-means algorithm to explore the vector opportunities each flavour may offer.
To build and execute the progam, you will need a c++ compiler and a python
interpreter. By default the Makefile is configures to execute clang++, but you
can easily change the compiler by properly setting the Makefile’s variable
CXX.
There are multiple versions implemented showing different behavior that can be controlled through preprocessor variables at compile time:
Extended Data Structure: The 2D point structure can be extended by an assignment variable indicating the centroid id to which the data point has been assigned. If not enabled, a separate array is used for assignments.
Assignment Fission: This option separates the cluster assignment step from the minimum distance loop. This approach enhances vectorization.
Nested Centroid Update: The first version updates all centroids simultaneously by iterating over all data points once and calculating the mean for all centroids at the same time. The second version iterates over all data points once for each centroid, enabling better vectorization.
Convergence Check: The convergence check can be switched off to force maximum iteration count.
Once the building options are properly set, run:
$ make
That command will generate the kmens.exe program.
$python input/gen_input.py --file input/test.in --k 5 --n 1000 --dim_x 100 --dim_y 100
Generated 1000 points around 5 centroids. Written to input/test.in.
In order to run the program, you can directly execute the command line manually providing the required paramenters:
OMP_NUM_THREADS=1 ./kmeans.exe input/small.in <DIM> <NPOINTS> <NCENTERS> <NITERS>
For example:
OMP_NUM_THREADS=1 ./kmeans.exe input/small.in 100 1000 5 20
Or you can rely on the pre-configured small, mid, and large workloads, example:
$ make run-small