report/report.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156

\documentclass{article}

\usepackage[margin=1in]{geometry}
\usepackage{parskip}
\usepackage{float}

\usepackage{xcolor}
\usepackage{graphicx}
\graphicspath{{./figures/}}

\usepackage{hyperref}
\hypersetup{colorlinks=true,
            linkcolor=black,
            urlcolor=blue,
            citecolor=black,
            pdftitle={High-Performance Square Matrix-Matrix Multiplication},
            pdfpagemode=FullScreen}

\usepackage{algpseudocode}

\title{High-Performance Square Matrix-Matrix Multiplication}
\author{Connor Moore, 100826701}

\begin{document}
\maketitle


\section{Introduction}
It is widely regarded in the scientific community that computers can be useful. Although the computing power of today provides great flexibility in what can be computed, there is continuing merit in trying to be as efficient as possible in writing software. \emph{High-performance computing} is computing with a focus on extreme performance \cite{Robey2021}, and it is the discipline that enables large-scale simulation efforts in various areas of science and engineering \cite{Curcic2020}.

Matrices are commonly found in computing. Because of their widespread use, it is beneficial to implement matrix operations as efficiently as possible, especially when considering prohibitively large systems. This report surveys various ways of taking a square matrix-matrix product from the perspective of minimizing wall times. The impact of different solution techniques, compilers, compiler flags, and matrix sizes are investigated. Different solutions were implemented in Fortran and programmatically timed to achieve this.

\section{Matrix-Matrix Multiplication Implementations}
When considering two $n\times n$ matrices $a$ and $b$, their product $c$ can be calculated using:
\begin{equation}
    c_{ij}=\sum_{k=1}^{n}a_{ik}b_{kj}
    \label{eq:matproduct}
\end{equation}
where the subscript $i$ and $j$ represent the index of each entry in the respective matrix \cite{Watkins2010}. 

\subsection{Triple-loop Techniques}
It is trivial to implement (\ref{eq:matproduct}) in most programming languages. The general algorithm for computing the product is:
\begin{algorithmic}
\State $b \gets 0$
\For{$i=1,n$}
    \For{$j=1,n$}
        \For{$k=1,n$}
            \State $c(i,j)=c(i,j)+a(i,k)\times b(k,j)$
        \EndFor
    \EndFor
\EndFor
\end{algorithmic}
where the subscript notation has been replaced by Fortran-like notation. The complexity (number of floating-point operations (FLOPs)) for this algorithm is $\mathcal{O}(2n^3)$ in asymptotic notation \cite{Watkins2010}.
\subsection{Fortran's \texttt{Matmul} Function}
Fortran, like many other languages, comes built-in with an intrinsic function for matrix-matrix multiplication. It is called as a function with \texttt{c = matmul(a,b)} using the same notation as above. The exact implementation of \texttt{matmul} and whether or not it calls an external library is compiler-dependent.

\subsection{BLAS Routines}
A number of libraries exist that provide existing high-performance matrix-matrix operations. The Basic Linear Algebra Subprograms (BLAS) library is a collection of various subroutines organized into 3 ``levels.'' Level 1 is vector-vector routines, level 2 is vector-matrix routines, and level 3 is matrix-matrix routines. BLAS provides a level 3 routine for matrix-matrix multiplication known as \texttt{DGEMM} (Double-precision General Matrix-matrix Multiplication), which can be used to find the product.

For the binaries compiled with \texttt{gfortran}, OpenBLAS 0.3.31 was used. OpenBLAS is a modern implementation of BLAS that provides better support for newer computer hardware. For binaries compiled with the Intel OneAPI suite, Intel's MKL BLAS was used. Library usage was verified using \texttt{ldd} on Linux for each binary.

\subsection{Compiler and Flag Considerations}
Two different Fortran compilers were tested; \texttt{gfortran} from the GNU compiler collection, and \texttt{ifx} from Intel's OneAPI suite. Both compilers have support for various levels of optimization with the \texttt{O0}, \texttt{O1}, \texttt{O2}, \texttt{O3}, and \texttt{Ofast} flags. Each of these flags were tested to quantify their impact on wall time.

Additional flags were specified to optimize performance. When a binary is compiled, efforts are made to keep it `portable' by avoiding specific instruction sets or niche optimizations unavailable in common hardware. Because the tests are only performed locally, both compilers were instructed to compile the highest-performance binary using all available hardware tricks. In \texttt{gfortran} this involved specifying the \texttt{march=native} flag \cite{GCC2024}. On \texttt{ifx} this was performed with the \texttt{xHost} flag \cite{Intel2025}.

\section{Comparisons and Results}
Runs were conducted parametrically and driven by a GNU Makefile. Results were evaluated using the \texttt{system\_clock} subroutine available for both compilers. This is preferred over calling \texttt{cpu\_time} because it natively accounts for the use of parallel workers. OpenMP also has routines which may be more accurate, however the code was compiled both with and without OpenMP, making it impractical to use.

The runs presented in the following section are a subset of the total data collected. The full dataset is provided in Appendix \ref{apx:results} of the document. All of the data was collected in serial (1 thread) or parallel (8 threads) on an 11th generation Intel i5-11300H CPU running at 4.40 GHz. Runs were done overnight on a bare TTY session with a minimal amount of background daemons running.
\subsection{Matrix Size}
As expected, an increase in matrix size corresponded with a non-linear increase in wall time. This was consistent for all compilers, flags, and techniques. An example dataset consisting of \texttt{gfortran} runs with \texttt{O3} optimization is presented in Figure \ref{fig:n-scaling}. No runs were conducted using triple-loops for values larger than $N=3500$ as it became prohibitively slow.

\begin{figure}[H]
    \centering
    \def\svdwidth{5in}
    \hspace*{-1.3cm}
    \input{figures/f1_n_scaling.tex}
    \caption{Size vs. Wall Time Scaling with \texttt{gfortran} serial \texttt{O3}.}
    \label{fig:n-scaling}
\end{figure}

\subsection{Compilers}

The compiler used had a considerable impact on the wall time of the computation. Initial comparisons were done using only serial runs for both \texttt{gfortran} and \texttt{ifx}, which are shown below in Figure \ref{fig:comp-scaling-1}.

\begin{figure}[H]
    \centering
    \def\svdwidth{5in}
    \hspace*{-1.3cm}
    \input{figures/f2_compilers_scaling.tex}
    \caption{Short Size vs. Wall Time Scaling with serial \texttt{O3}.}
    \label{fig:comp-scaling-1}
\end{figure}

The triple-loop runs showed a roughly 2 order of magnitude gap between the GCC and OneAPI results, with OneAPI being faster. Interestingly, GCC was much closer to the OneAPI performance when using OpenBLAS and faster when using \texttt{matmul}, as shown in Figure \ref{fig:comp-scaling-2}.

\begin{figure}[H]
    \centering
    \def\svdwidth{5in}
    \hspace*{-1.3cm}
    \input{figures/f3_compilers_scaling.tex}
    \caption{Long Size vs. Wall Time Scaling with serial \texttt{O3}.}
    \label{fig:comp-scaling-2}
\end{figure}

A comparison was also made to see if the compilers provided similar support for parallelism using OpenMP. This was implemented by hand for the triple-loop techniques, but comes default for OpenBLAS and MKL BLAS when compiled with the \texttt{fopenmp} flag. 8 threads were used for all runs, which was verified using \texttt{htop}. The short runs are presented below in Figure \ref{fig:comp-scaling-3}.

\begin{figure}[H]
    \centering
    \def\svdwidth{5in}
    \hspace*{-1.3cm}
    \input{figures/f4_compilers_scaling.tex}
    \caption{Short Size vs. Wall Time Scaling with parallel \texttt{O3}.}
    \label{fig:comp-scaling-3}
\end{figure}

The scaling is markedly non-linear in the log-log plot when compared with Figure \ref{fig:comp-scaling-1}. It is likely that the balance between the overhead of setting up 8 workers and computing the product is suboptimal for such small values of $N$. That being said, there is still a clear trend of OneAPI offering superior performance for all presented $N$. The long runs show a different trend and are presented in Figure \ref{fig:comp-scaling-4}.

\begin{figure}[H]
    \centering
    \def\svdwidth{5in}
    \hspace*{-1.3cm}
    \input{figures/f5_compilers_scaling.tex}
    \caption{Long Size vs. Wall Time Scaling with parallel \texttt{O3}.}
    \label{fig:comp-scaling-4}
\end{figure}

When performing calls to \texttt{matmul} and BLAS, GCC has higher performance until $N$ approaches roughly $700\times700$ in size. It is likely that part of this can be attributed to statistical noise as the time scale is on the order of milliseconds and the \texttt{system\_clock} subroutine may be biased between the two. GCC and OneAPI share similar performance for all $N$ up to $12000\times12000$ with the exception of \texttt{ifx}'s \texttt{matmul}, which is consistently slower.

\subsection{Compiler Flags}

\subsection{Parallelism}

\section{Conclusions}

\bibliographystyle{ieeetr}
\bibliography{refs.bib}

\clearpage
\appendix

\section{Compiling and Testing the Code}
The git repository is hosted at \url{https://git.hhmoore.ca/mcsc-6030g/p1-matrix-product} and contains various files. All files related to the report, including exported figures, are kept within the \texttt{report} subdirectory. An explanation is provided for everything else:
\begin{description}
    \item \texttt{matrixproduct.f90} is the actual Fortran program that computes the matrix products. It does this in three major ways: via triple-loop(s), using Fortran's intrinsic \texttt{matmul} function, and with OpenBLAS. Two variants of the triple-loop technique are provided, with one being row-major and the other column-major. The compiled binary expects command-line arguments of the start, stop, and step size to run properly. This is followed by a string \texttt{yes/no} input that toggles the triple-loop solutions on or off. These are turned off for "long" runs because they are prohibitively slow.

    \item \texttt{Makefile} is a GNU Make script that drives the compilation and testing process. There are various make targets provided in the file. To produce a full set of results, simply type \texttt{make tests}. The default target ran by just calling \texttt{make} only compiles the binaries.

    \item \texttt{plots.gnu} is a script that generates plots for the report using Gnuplot. The makefile target for plots can be run with \texttt{make plots}. This will produce plots in hybrid \texttt{.pdf} and \texttt{.tex} formats that embed cleanly in the \LaTeX document.
\end{description}

\section{Tabular Results}
\label{apx:results}

\end{document}