Benchmarking Distributed GEMM Algorithms

Laurent Fainsin

Introduction

In this report, we present the results of a benchmark study on three variations of the distributed matrix-matrix multiplication (GEMM) algorithm: blocking peer-to-peer communication, blocking collective communication, and non-blocking peer-to-peer communication. The GEMM operation is a fundamental operation in high performance scientific computing, and these three algorithms represent different approaches to implementing it in a distributed environment. The goal of the study was to assess the scalability of the algorithms and the impact of their different approaches on performance. The results of the study will provide insights into the strengths and limitations of the different algorithms and guide the development of future high performance scientific computing applications. The VITE visualizer will not be used in this report, as its usability is not good.

Methodology

We implemented and benchmarked three variations of the distributed matrix-matrix multiplication (GEMM) algorithm in C using the OpenMPI library. The algorithms were compiled with the smpicc compiler and executed on truite.enseeiht.fr, a machine with two Intel Xeon E5-2620 (totaling 32 cores). For each combination of the number of nodes (p and q), the problem dimensions (m, n and k) and “hyperparameters” such as the look-ahead la and the square block dimension b, we recorded the execution speeds and plotted them using matplotlib in python. All runs were repeated 5 times (--niter) to denoise the plots.

Results

Scalability according to m, n and k

We decide that m = n = k (for simplicity). We also set p = 10, q = 10, la = 10 and np = 100.

Here are our results for b = 100:

mnk_time_b100.png mnk_gflops_b100.png

And for b = 1000:

mnk_time_b1000.png mnk_gflops_b1000.png

From these plots we conclude that the Non-blocking peer-to-peer communications algorithm (p2p-i-la) scales way better than the two others algorithms. The Blocking peer-to-peer communication (p2p) and Blocking collective communication (bcast) have similar performances, although bcast seems slightly better. We also do not observe a decreasing yield with respect to Gflops, these algorithms seem fully scalable from the resources (truite.enseeiht.fr) we have at our disposal.

Scalability according to p, q and np

We decide that q = 1 and np = q (for simplicity). We also set la = 10, b = 100, n = 1000, m = 1000 and k = 1000.

pq_time.png pq_gflops.png

From these plots we conclude that low and high numbers of nodes is not beneficial for performance. This seems natural and we deduce that optimal performance is achieved when p is equal to our number of cpu cores (32 in our case as we only have one machine).

Impact of look-ahead la

In the Non-blocking peer-to-peer communications algorithm (p2p-i-la) we define a lookahead value la, which is the number of blocks to overlap communication and computation. This can help our algorithm to be more efficient by caching our next operations.

la_time.png la_gflops.png

From these figures, we conclude that it is not useful to select a lookahead value exceeding 10. The performance gain seems to stagnate after this critical value of 10 (even at very high lookahead values such as 10000).

Impact of block dimension b

We set la = 10, np = 16, p = 4, q = 4, n = 5000, m = 5000 and k = 5000.

b_time.png b_gflops.png

From these figures, we guess that low and high values of b seem to hurt performance. We deduce that well chosen values of b (approximately such that b = n/np) allow for optimal performances.

Conclusion

In conclusion, this report presents the results of a benchmark study on three variations of the distributed matrix-matrix multiplication (GEMM) algorithm. The results of the study provide insights into the strengths and limitations of the different algorithms and guide the development of future high-performance scientific computing applications. It was observed that the Non-blocking peer-to-peer communications algorithm (p2p-i-la) scales better than the other two algorithms, blocking peer-to-peer communication (p2p) and blocking collective communication (bcast). It was also observed that the optimal performance is achieved when the number of nodes (p) is equal to the number of CPU cores. Additionally, the study also concludes that it is not useful to select a lookahead value exceeding 10 in the Non-blocking peer-to-peer communications algorithm and the performance gain seems to stagnate after this critical value. The parameter of block dimension b also has impact on the algorithm performance.