From 5c8bacf899769bf56c553abe6c990ba487dcee67 Mon Sep 17 00:00:00 2001
From: Connor Moore <connor@hhmoore.ca>
Date: Fri, 13 Feb 2026 20:24:50 -0500
Subject: Added section on parallelization and fixed references.

---
 report/report.tex | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

(limited to 'report/report.tex')

diff --git a/report/report.tex b/report/report.tex
index 03eec31..003d01b 100644
--- a/report/report.tex
+++ b/report/report.tex
@@ -86,6 +86,7 @@ Runs were conducted parametrically and driven by a GNU Makefile. Results were ev
 The runs presented in the following section are a subset of the total data collected. The full dataset is provided in Appendix \ref{apx:results} of the document. All of the data was collected in serial (1 thread) or parallel (8 threads) on an 11th generation Intel i5-11300H CPU running at 4.40 GHz. Runs were done overnight on a bare TTY session with a minimal amount of background daemons running.
 
 \subsection{Matrix Size}
+\label{sec:matrix-size}
 As expected, an increase in matrix size corresponded with a non-linear increase in wall time. Specifically, it tended to $\mathcal{O}(n^3)$, which is the theoretical complexity of the product discussed in Section 2. This was consistent for all compilers, flags, and techniques. An example dataset consisting of \texttt{gfortran} runs with \texttt{O3} optimization is presented in Figure \ref{fig:n-scaling}. No runs were conducted using triple-loops for values larger than $N=3500$ as it became prohibitively slow.
 
 \begin{figure}[H]
@@ -176,7 +177,35 @@ There is a clear speedup seen by using \texttt{Ofast} compared to other flags up
 
 Unlike \texttt{gfortran}, the \texttt{ifx} compiler has a clear difference between \texttt{O1} and the remaining flags for all values of $N$ shown. The remaining flags are all similar in their performance, barring for some small fluctuations. This implies that \texttt{O2} has the largest performance gain, and the others don't impact the wall time as much.
 
-The Intel OneAPI documentation has a page (helpfully titled ``O'') that discusses what the different optimization flags imply \cite{Intel2025}. The \texttt{O2} flag is specifically the lowest-level of optimization that supports auto vectorization of loops, which would explain the large discontinuity in wall time between the flags. The i5-11300H processor supports an extended instruction set including AVX-512, a 512-bit extension of the Advanced Vector Extensions (AVX) instruction set. This allows for SIMD processing of up to 8 \texttt{real64} floating-point numbers simultaneously. It is likely that the use of lanes explains a large part of the 10$\times$ decrease in wall time; other reasons could include loop unrolling or the use of a fused multiply-add (FMA) instruction, both of which are possible with \texttt{O2}-level optimization.
+The Intel OneAPI documentation has a page (helpfully titled ``O'') that discusses what the different optimization flags imply \cite{Intel2025}. The \texttt{O2} flag is specifically the lowest-level of optimization that supports auto vectorization of loops, which would explain the large discontinuity in wall time between the flags. The i5-11300H processor supports an extended instruction set including AVX-512, a 512-bit extension of the Advanced Vector Extensions (AVX) instruction set \cite{IntelWebsitei511300}. This allows for SIMD processing of up to 8 \texttt{real64} floating-point numbers simultaneously. It is likely that the use of lanes explains a large part of the 10$\times$ decrease in wall time; other reasons could include loop unrolling or the use of a fused multiply-add (FMA) instruction, both of which are possible with \texttt{O2}-level optimization.
+
+\subsection{Parallelization}
+
+For every test run in serial, another ran in parallel. Presently only a binary comparison is made without variation in the number of threads. Results are presented below for the \texttt{gfortran} runs compiled with the \texttt{Ofast} flag. The general trend is consistent for the runs using a call to OpenBLAS, shown in Figure \ref{fig:parallel-blas}.
+
+\begin{figure}[H]
+    \centering
+    \def\svdwidth{5in}
+    \hspace*{0.1cm}
+    \input{figures/f8_parallel_speedup.tex}
+    \caption{Serial and parallel ($n=8$) \texttt{gfortran} performance when calling OpenBLAS. The parallel runs show a consistent decrease in run time, with serial taking anywhere from ~30\% to ~60\% more wall time depending on the size of matrix. No plateau representing the overhead costs is observed.}
+    \label{fig:parallel-blas}
+\end{figure}
+
+As expected, there is a tangible decrease in wall time when comparing serial to parallel results. The green dashed line presents the percent difference between parallel and serial, and for example a value of 30\% means the serial run has a wall time 30\% larger. Worth noting is that there is a distinct lack of a `plateau' at lower $N$ that could be expected from having to provision the parallel workers. Instead, the threaded runs had consistently lower wall times  $\forall N \ne 100$. Also worth discussing is that the greatest difference achieved is roughly 60\% in the lower-$N$ portion, but only ~30\% in as $N$ increases to 3500. It is counter intuitive that, as $N$ grows into the low thousands, the `gains' from computing in parallel stagnate after decreasing by almost half. 
+
+As mentioned earlier, the use of multiple threads was confirmed independently of the testing code via an external resource monitor, so it should not explain the discrepancy. Further, the times recorded were independent of any non-parallelizable activities, such as initializing random numbers or writing results to disk. Paradigms such as Amdahl's law show that the speedup of a program depends heavily on how much is readily `parallelizable' \cite{Hill_2008}, which could explain the poor performance had the \emph{entire} program be timed. It is likely that whatever solution technique was used in OpenBLAS does not have perfect scaling between wall time and the number of threads employed. More variation was seen in the wall times for a triple-loop, shown below in Figure \ref{fig:parallel-loops}.
+
+\begin{figure}[H]
+    \centering
+    \def\svdwidth{5in}
+    \hspace*{0.1cm}
+    \input{figures/f9_parallel_speedup.tex}
+    \caption{Serial and parallel ($n=8$) \texttt{gfortran} performance when comparing triple-loops. Parallel shows a consistent advantage at larger $N$ with serial taking up to ~150\% longer to run after $N=3300$. Uncertainty is shown with smaller $N$, likely as a result of comparing already low wall-times.}
+    \label{fig:parallel-loops}
+\end{figure}
+
+Similar variations in the wall time difference are expected at lower $N$, as they are in range of milliseconds. As $N$ exceeds 3000, however, there is a consistent difference of over ~100\% observed. This does not make the loop approach inherently `better' than OpenBLAS, but it does show that seems to benefit more from parallelization. Indeed, the significant difference in wall times between the two in serial (Section \ref{sec:matrix-size}, Figure \ref{fig:n-scaling}) agin points to a difference solution technique being utilized that is not as receptive to parallelization. While not conclusive of any direct measure of wall time versus thread count, both plots showed improvements from employing parallel processing to matrix-matrix multiplication.
 
 \bibliographystyle{ieeetr}
 \bibliography{refs.bib}
@@ -196,7 +225,6 @@ The git repository is hosted at \url{https://git.hhmoore.ca/mcsc-6030g/p1-matrix
     \item \texttt{plots.gnu} is a script that generates plots for the report using Gnuplot. The makefile target for plots can be run with \texttt{make plots}. This will produce plots in hybrid \texttt{.pdf} and \texttt{.tex} formats that embed cleanly in the \LaTeX document.
 \end{description}
 
-\clearpage
 \section{Tabular Results}
 \label{apx:results}
 \begin{longtable}{llcccccc}
-- 
cgit v1.2.3