diff options
Diffstat (limited to 'report/Chapters/optimization.tex')
| -rw-r--r-- | report/Chapters/optimization.tex | 115 |
1 files changed, 115 insertions, 0 deletions
diff --git a/report/Chapters/optimization.tex b/report/Chapters/optimization.tex new file mode 100644 index 0000000..c8ae81b --- /dev/null +++ b/report/Chapters/optimization.tex @@ -0,0 +1,115 @@ +% Chapter on optimization methods +\chapter{Optimization} +\label{ch:Optimization} + +\section{Domain Decomposition} + +In evaluating the performance of the original simulation algorithm, a significant amount of time is spent on the 'spatial lookup' for pair-wise interactions between particles. This bottleneck emerges due to the current implementation of particle interactions where the distance between a given particle and every other particle must be evaluated in every time step, with the LJ force only applied if the particles are within the interaction radius of $r_{cut} = \sigma 2^{1/6}$. Given that the cutoff radius for the LJ potential is significantly smaller than the size of the box, a majority of these calculations are between particles that will not interact with each other. Domain decomposition serves to limit the number of these pair-wise evaluations by avoiding comparisons between distant particles entirely. + +This method works by dividing the spatial domain of the simulation into a grid of square sectors; used to manage the number of interaction checks needed for each particle. Setting the side length of the sectors to be, at minimum, the interaction radius, particles separated by more than one sector are guaranteed to not interact and can thus be ignored in the short-range interaction calculation, but must be rounded up to a value which evenly divides the box length. This method limits the number of pair-wise evaluations for a given particle to only include those particles which inhabit the same sector, or the eight surrounding sectors (a.k.a the Moore Neighbourhood of that particle's sector). + +\subsection{Creating the Sector Grid} +For this implementation of domain decomposition, typewriter-style indexing is utilized, where the $(0,0)$ sector coordinate corresponds to the top-left corner of the simulation box. With the simulation box centred about the origin at $(0.0,0.0)$, we can create two functions which convert the spatial coordinates $(x,y)$ to sector coordinates $(s_x,s_y)$, + +\begin{equation} + \begin{split} + s_x(x) &= \left\lfloor \frac{x + L/2}{l_S} \right\rfloor\\ + s_y(y) &= S_y - \left\lfloor \frac{y + L/2}{l_S} \right\rfloor. + \end{split} + \label{pos-to-sectCoords_DD} +\end{equation} + +Additionally, each sector is also assigned a unique ID to allow all sector-based data to be stored in 1D arrays. To determine these unique IDs, we define a position-to-index function, + +\begin{equation} + S_{ID}(x,y) = s_x(x) + s_y(y)S_y, + \label{pos-to-index_DD} +\end{equation} +which takes spatial coordinates $(x,y)$ as input and returns the sector ID using the sector coordinates calculated from \eqref{pos-to-sectCoords_DD}. + +\begin{figure} + \centering + \includegraphics[width=0.5\linewidth]{Figures/sectorGridDD.png} + \caption{Domain decomposition grid for a box size of $L=10$ centered around the origin with sector size $l_S = 1.0$ for visualization purposes. The spatial coordinates $(x,y)$ are shown on the axes, while the sector coordinate and index is written in the bottom left and top left respectively for each sector.} + \label{fig:sectorGrid} +\end{figure} + +An example of the sector grid setup for domain decomposition is shown in figure \ref{fig:sectorGrid}, where the $L=10.0$ simulation box has been divided into $100$ sectors with length $l_S = 1.0$. + +\subsection{Updating Spatial Lookup} +After initializing the sector grid prior to the main simulation loop, the bulk of the work done by domain decomposition occurs in the middle of the Velocity Verlet process between the position update and acceleration updates. In Algorithm \ref{VVDD_algo}, the modified Velocity Verlet program is outlined with the domain decomposition functions, \code{updateParticleSector()}, \code{updateSpatialLookup()} and \code{updateNeighbourInteraction()}, added to the procedure. + +Following each particle's position update, the \code{updateParticleSector($x,y$)} function takes the particle's $x$ and $y$ positions as input and assigns the corresponding sector coordinates and index to that particle using \eqref{pos-to-sectCoords_DD} and \eqref{pos-to-index_DD}. Once all particles have completed their position/sector updates, the \code{updateSpatialLookup()} function then sorts the particles by their sector index and returns the array \code{sectorStartIndices[$M$]} where the value stored at the $i^{th}$ entry is the index in the sorted particle list where the particles located in sector $i$ begin. Finally, the \code{updateNeighbourInteraction()} function is called just before the acceleration update and evaluates the net force from interaction between each particle and its neighbours in the surrounding sectors. + +\begin{algorithm}[h] +\caption{Velocity Verlet Main Loop with Domain Decomposition Functions} +\label{VVDD_algo} +\begin{algorithmic} +\Ensure Sector grid has been initialized with squares of size $l_S \geq r_{cut}$ +\While{$t < t_{\max}$} + \For{$i = 1$ to $N$} + \State $v_{hx} \gets v_x + \frac{1}{2} a_x \, dt$ \Comment{1. half-step velocity} + \State $x \gets x + v_{hx} \, dt$ \Comment{2. update position} + \State applyBoundaryConditions($x,y$) + \State updateParticleSector($x,y$) \Comment{update sector coords and index} + \EndFor + + \State updateSpatialLookup(particles, sectors) \Comment{Sort particles by cell key} + \State updateNeighbourInteraction(particles, sectors) \Comment{Search for and evaluate interaction with neighbours} + \For{$i =1$ to $N$} + \State $a_x \gets drag*v_{hx} + noise*R(t) + F_x$ \Comment{3. acceleration} + \State $v_x \gets v_{hx} + \frac{1}{2} a_x \, dt$ \Comment{4. full-step velocity} + \State Record data + \EndFor + \State Compute ensemble averages + \State $t \gets t + dt$ +\EndWhile +\end{algorithmic} +\end{algorithm} + + + +\section{Parallelization with OpenMP} + +\subsection{Parallel Neighbour Search} +%- the primary purpose of introducing parallelization is to distribute the work load of the neighbour search over multiple threads + +Introducing parallelization distributes the work load of the neighbour search over multiple threads. Since the neighbour search in Algorithm 9 iterates over sectors independently, the work can be divided so that each thread handles a subject of sectors. \code{Write} operations for each sector search are therefore exclusive to the thread that handles them. + +Race conditions are avoided by the fact that no two threads write data for the same sector, and that the data being written for each sector is not read by other threads during the force update loop-- complications were avoided by introducing scrap variables. + +%- we parallelize the search over sectors so that the \code{write} operations for each sector are exclusive to the thread that handles them + +%- race conditions are avoided by the fact that no two threads write data for the same sector, and that the data being written for each sector is not read by other threads during the force update loop. + +\subsection{Parallelizing the Main Loop} +%- after introducing parallelization for the neighbour search, the performance gains we found were suboptimal to say the least. + +%- we found that the performance boost was being inhibited by two things: +% 1. The parallel threads were being initialized once each time step, leading to redundant overhead +% 2. The write-to-disk and random number generation are costly functions in comparison to simple FLOPs, and a lot of time is spend waiting on these + +%- to mitigate the overhead of parallelizing threads, the fork was moved to outside the main simulation loop. Initializing the threads before the loop cut down on the total overhead from reinitializing them every step, but leaves most of the threads idle when outisde of the neighbour search. + +%- we can make use of some of these idle threads by giving them tasks that can be done in parallel: +% 1. One thread is used to write to disk (must create copies of the data to be written) +% 2. One thread is used to generate pseudo-random numbers +% 3. One thread is used to do the position/sector updates + +Testing proved parallelization to just the neighbour search resulted in suboptimal performance gains. Analysis found the performance boost was being inhibited by two things. One, the parallel threads were being initialized once each time step, leading to redundant overhead. Secondly, the disk I/O and random number generation are costly relative to simple FLOPs, and a lot of time is spent waiting for the task to conclude. + +To mitigate the overhead of parallelizing threads, the fork was moved to outside the main simulation loop. Initializing the threads before the loop cut down on the total overhead from re-initializing them every step, but leaves most of the threads idle when outside of the neighbour search. These idle threads could be utilized by giving them tasks that can be done in parallel: one thread is used to write to disk (must create copies of the data to be written), one thread is used to generate pseudo-random numbers, one thread is used to do the position/sector updates. + +\section{Writing to Disk} +%- even with parallelization, writing to disk proves to be a troublesome bottleneck. +%- Until now, the particle data has been written to the trajectory file at every time step. This results in extremely dense trajectory data which is mostly redundant in the context of stochastic motion. +%- By introducing a recording interval where data is only written every $1000$ steps, for example, we can drastically reduce the time spent writing without losing any actual information. + +Even with parallelization, disk I/O remains a significant bottleneck. In the baseline implementation, particle positions, and velocities are written to the trajectory file at every time step, producing an extremely dense output that is largely redundant in the context of stochastic motion. To reduce this cost, a recording interval was introduced so that trajectory data is only written every $n_{rec}$ time steps. This drastically reduces the volume of I/O without any meaningful loss of psychical information, since the motion between recorded frames is smooth and predictable relative to the timescale of interest. + + + + + + + |
