In this report, we present the results of a benchmark study on three variations of the distributed matrix-matrix multiplication (GEMM) algorithm: blocking peer-to-peer communication, blocking collective communication, and non-blocking peer-to-peer communication. The GEMM operation is a fundamental operation in high performance scientific computing, and these three algorithms represent different approaches to implementing it in a distributed environment. The goal of the study was to assess the scalability of the algorithms and the impact of their different approaches on performance. The results of the study will provide insights into the strengths and limitations of the different algorithms and guide the development of future high performance scientific computing applications. The VITE visualizer will not be used in this report, as its usability is not good.
We implemented and benchmarked three variations of the distributed
matrix-matrix multiplication (GEMM) algorithm in C using the OpenMPI
library. The algorithms were compiled with the smpicc compiler and
executed on truite.enseeiht.fr, a machine with two Intel Xeon E5-2620
(totaling 32 cores). For each combination of the number of nodes
(p
and q
), the problem dimensions
(m
, n
and k
) and
“hyperparameters” such as the look-ahead la
and the square
block dimension b
, we recorded the execution speeds and
plotted them using matplotlib in python. All runs were repeated 5 times
(--niter
) to denoise the plots.
m
, n
and k
We decide that m
= n
= k
(for
simplicity). We also set p
= 10, q
= 10,
la
= 10 and np
= 100.
Here are our results for b
= 100:
And for b
= 1000:
From these plots we conclude that the Non-blocking peer-to-peer
communications algorithm (p2p-i-la
) scales way better than
the two others algorithms. The Blocking peer-to-peer communication
(p2p
) and Blocking collective communication
(bcast
) have similar performances, although
bcast
seems slightly better. We also do not observe a
decreasing yield with respect to Gflops, these algorithms seem fully
scalable from the resources (truite.enseeiht.fr) we have at our
disposal.
p
, q
and np
We decide that q
= 1 and np
=
q
(for simplicity). We also set la
= 10,
b
= 100, n
= 1000, m
= 1000 and
k
= 1000.
From these plots we conclude that low and high numbers of nodes is
not beneficial for performance. This seems natural and we deduce that
optimal performance is achieved when p
is equal to our
number of cpu cores (32 in our case as we only have one machine).
la
In the Non-blocking peer-to-peer communications algorithm
(p2p-i-la
) we define a lookahead value la
,
which is the number of blocks to overlap communication and computation.
This can help our algorithm to be more efficient by caching our next
operations.
From these figures, we conclude that it is not useful to select a lookahead value exceeding 10. The performance gain seems to stagnate after this critical value of 10 (even at very high lookahead values such as 10000).
b
We set la
= 10, np
= 16, p
=
4, q
= 4, n
= 5000, m
= 5000 and
k
= 5000.
From these figures, we guess that low and high values of
b
seem to hurt performance. We deduce that well chosen
values of b (approximately such that b
=
n
/np
) allow for optimal performances.
In conclusion, this report presents the results of a benchmark study
on three variations of the distributed matrix-matrix multiplication
(GEMM) algorithm. The results of the study provide insights into the
strengths and limitations of the different algorithms and guide the
development of future high-performance scientific computing
applications. It was observed that the Non-blocking peer-to-peer
communications algorithm (p2p-i-la
) scales better than the
other two algorithms, blocking peer-to-peer communication
(p2p
) and blocking collective communication
(bcast
). It was also observed that the optimal performance
is achieved when the number of nodes (p
) is equal to the
number of CPU cores. Additionally, the study also concludes that it is
not useful to select a lookahead value exceeding 10 in the Non-blocking
peer-to-peer communications algorithm and the performance gain seems to
stagnate after this critical value. The parameter of block dimension
b
also has impact on the algorithm performance.