5 files changed, 501 insertions, 0 deletions
diff --git a/report/Chapters/algorithms.tex b/report/Chapters/algorithms.tex
new file mode 100644
index 0000000..749a0b5
--- /dev/null
+++ b/report/Chapters/algorithms.tex
@@ -0,0 +1,99 @@
+\chapter{Algorithms}
+\label{AppAlgos}
+\section{Domain Decomposition Algorithms}
+
+\subsubsection{Updating Particle Sector}
+
+The \code{updateParticleSector(x,y)} function outlined in algorithm \ref{updateParticleSector_code} is called for each particle after their position is updated in the movement loop. Taking the particles position coordinates as input, the function calculates the sector coordinates (\code{sectX, sectY})for that position and the corresponding sector index using equations \eqref{pos-to-sectCoords_DD} and \eqref{pos-to-index_DD} respectively. A validation check is also included in the function where particles at invalid positions are flagged by assigning them the sector index $-1$. As particles are assigned to their respective sectors, we also keep track of sector populations, which are used in the subsequent \code{updateSpatialLookup()} function for sorting the particles. 
+
+
+\subsubsection{Particle Sorting and Spatial Lookup}
+The spatial lookup update algorithm outlined in \ref{updateSpatialLookup_code} shows the procedure for preparing the optimized spatial lookup. First, the populations of each sector are used to determine the size of the block needed for a given sector. Following the first sector which always starts at index 0, the start index of each subsequent sector block is calculated as the start index of the previous block plus its population. This first loop yields the \code{sectorStartIndices[M]} array where the $i^{th}$ entry is the index in the sorted particle list where the block for sector $i$ begins. Using the start indices array, particles can then be directly inserted into the sorted list based on their respective sector indices. As particles are placed into the list, the population of inserted sector members is tracked so that no two particles are placed at the same index.
+
+While typical sorting methods like Quicksort and Merge Sort with complexity of $\O{N\log N}$ would offer a performance boost over the $\O{N^2}$ scaling of the original implementation, using the apriori knowledge about the sector populations allows us to produce the sorted particle list in only $\O{N}$ time. Thus, a significant boost to performance is expected from the implementation of this method. 
+
+\subsubsection{Neighbourhood Search}
+The final step of domain decomposition in the simulation loop is to make use of the sorted list and start indices produced from the spatial lookup. In algorithm \ref{updateNeighSearch_code}, the procedure for finding interacting neighbours is outlined. The outermost loop iterates over each sector $s$, first determining the start and end indices of $s$. If the start and end indices for a given sector are the same, it has a population of $0$ and is skipped by empty sector check. If the sector is populated, the first inner loop is entered which iterated over the particles of sector $s$. For each of these particles $p1$, the next For-loop goes through each of the (up to) nine surrounding sectors and identifies neighbouring particles $p2$ in those sectors. The separation distance between $p1$ and each $p2$ is then evaluated and the corresponding LJ force is applied to $p1$. Following the loop, each particle has its net deterministic force $F$, which is then used in the Velocity Verlet acceleration update. 
+
+\begin{algorithm}[h]
+\caption{updateParticleSector Function}
+\label{updateParticleSector_code}
+\begin{algorithmic}
+    \State sectorPops[M] = [0,0,...,0] \Comment{Array to store sector populations}
+    \Function{updateParticleSector}{$x$,$y$}
+    \State sectX $\gets$ floor$((x + L/2)/(l_s))$
+    \State sectY $\gets S_y -$ floor$((x + L/2)/(l_s))$
+    \If{sectX $> S_x$ OR sectY $> S_y$} \Comment{Set out-of-bounds particles to null sector}
+        \State sectorIndex $\gets -1$
+    \Else 
+        \State sectorIndex $\gets sectX + sectY*S_y$ \Comment{Assign sector index to particle}
+        \State sectorPops[sectInd] $+= 1$ \Comment{Add one to sector population}
+    \EndIf
+    \EndFunction
+\end{algorithmic}
+\end{algorithm}
+
+
+\begin{algorithm}[h]
+\caption{updateSpatialLookup Function}
+\label{updateSpatialLookup_code}
+\begin{algorithmic}
+    \State particles[N] \Comment{Incoming list of particles}
+    \State sortedParticles[N] \Comment{An empty array to be filled with sorted particles}
+    \State sectorPops[M] \Comment{Array of sector populations}
+    \State sectorStartIndices[M] = [-1, -1, ... , -1] \Comment{Start index array initialized to -1}
+    \Function{updateSpatialLookup}{particles[N]}
+    \State Determine where each sector will start in the particle list:
+    \For{$i = 1$ to $M$} \Comment{Loop over each sector}
+        \If{$i==0$} \Comment{Special case for first sector: always starts at index 0}
+            \State sectorStartIndices[i] $\gets 0$
+        \Else
+            \State sectorStartIndices[i] $\gets$ sectorStartIndices[i-1] + sectorPops[i-1]
+            \State sectorPops[i-1] $\gets 0$ \Comment{Set sector pop to zero to recount during insertion step} 
+        \EndIf
+    \EndFor
+
+    \State Insert particles into sorted list:
+    \For{$p=1$ to $N$}
+        \State $i \gets$ particle.sectorIndex
+        \State insertIndex $\gets$ sectorStartIndices[i] + sectorPops[i]
+        \State sectorPops[i] $+= 1$
+    \EndFor
+
+    \State particles[N] $\gets$ sortedParticles[N] \Comment{Copy sorted list into the original list}
+    \EndFunction
+\end{algorithmic}
+\end{algorithm}
+
+
+
+\begin{algorithm}[h]
+\caption{updateNeighbourInteraction Function}
+\label{updateNeighSearch_code}
+\begin{algorithmic}
+    \State 
+    \Function{updateNeighbourInteraction}{particles[N], sectors[M]}
+    \For{$s = 1$ to $M$} \Comment{Loop over sectors}
+        \State sectorStart $\gets$ sectorStartIndices[s]
+        \State sectorEnd $\gets$ sectorStart + sectorPops[s]
+        \For{$p1 =$ sectorStart to sectorEnd} \Comment{Loop over particles in sector $s$}
+            \For{$i =1$ to $9$} \Comment{Loop over neighbouring sectors}
+                \State searchStart $\gets$ sectorStartIndices[i]
+                \State searchEnd $\gets$ searchStart + sectorPops[i]
+                \For{$p2 =$ searchStart to searchEnd} \Comment{Loop over particles in neighbouring sector $i$}
+                    \If{$p1 == p1$}
+                        continue \Comment{skip self-interaction}
+
+                    \Else
+                        \If{$||p1 - p2|| \leq r_{cut}$} \Comment{If separation distance is less than cutoff}
+                            \State $F_{p1} +=$ applyLJ($p1,p2$) \Comment{Add LJ force to p1}
+                        \EndIf
+                    \EndIf
+                \EndFor
+            \EndFor
+        \EndFor
+    \EndFor  
+    
+    \EndFunction
+\end{algorithmic}
+\end{algorithm}
+\ No newline at end of file
diff --git a/report/Chapters/introduction.tex b/report/Chapters/introduction.tex
new file mode 100644
index 0000000..5bb10b3
--- /dev/null
+++ b/report/Chapters/introduction.tex
@@ -0,0 +1,39 @@
+\chapter{Introduction}
+\section{Basic Physics}
+This project presents a Langevin dynamics simulation of $N$ particles undergoing Brownian motion, confined in a two-dimensional square box of side length $L = 1.0$ with reflective boundary conditions. The primary objective is to study particle motion through the root-mean-square displacement (RMSD), which allows identification of ballistic ($\sim t^2$) and diffusive ($\sim t$) regimes. A secondary objective is computational: the most expensive component of the simulation is the evaluation of pairwise particle interactions arising from the WCA potential, which scales as $\mathcal{O}(N^2)$ per timestep. To address this, the program is structured using a domain decomposition algorithm to reduce computational cost and enable efficient parallelization with OpenMP. Brownian motion describes particles suspended in a fluid medium and subject to random thermal forces arising from microscopic collisions \cite{feynman41}. The equation of motion for a single particle is given by the Langevin equation:
+
+\begin{equation}
+  m\ddot{\mathbf{x}} = -\nabla U(\mathbf{x}) - \gamma \dot{\mathbf{x}} + 
+  \sqrt{2m\gamma k_BT}\, \mathbf{R}(t)
+\end{equation}
+
+where $-\nabla U(\mathbf{x})$ is the conservative WCA force, $-\gamma\dot{\mathbf{x}}$ represents viscous drag, and $\mathbf{R}(t)$ is a stochastic noise term whose strength scales with temperature. At short timescales, particle motion is ballistic, retaining memory of the initial velocity, while at longer timescales the motion becomes diffusive due to the randomizing effect of the noise. The simulation is implemented using the velocity-Verlet integration scheme, with random forces generated using the Box--Muller transform and initial velocities sampled from the Maxwell--Boltzmann distribution.
+
+Pairwise particle interactions are modelled using the Weeks--Chandler--Andersen (WCA) potential, the purely repulsive truncation of the Lennard-Jones potential. The force between particles $i$ and $j$ is given by
+\begin{equation}
+  \mathbf{F}_{ij} = -\nabla V(r_{ij}),
+\end{equation}
+
+where the Lennard-Jones potential is
+
+\begin{equation}
+  V(r) = 4\varepsilon \left[ \left(\frac{\sigma}{r}\right)^{12} - 
+  \left(\frac{\sigma}{r}\right)^6 \right].
+\end{equation}
+
+The interaction is truncated at the cutoff distance $r_c = 2^{1/6}\sigma$, corresponding to the minimum of the potential; for $r > r_c$ both the potential and force are set to zero.
+
+%\section{Boundary Conditions}
+%Various boundary conditions (BCs) exist when considering particle transport problems. The most common types include vacuum BCs, where particles are considered finished once they leave the domain, reflective BCs, where particles bounce off the domain boundary, and 
+%periodic BCs, where a particle leaving one side re-enters from the opposite side \cite{demaziere2020}. This simulation employs specular reflective BCs on all sides of the box, illustrated in Figure \ref{fig:reflective-BC}.
+
+%\begin{figure}[H]
+%    \centering
+%    \input{Figures/reflectiveBC.pdf_tex}
+%    \caption{Working principle of a specular reflective boundary condition. Red: A particle 
+%    penetrates the boundary after a discrete time step. Magenta: The particle position in 
+%    the $L$-direction is adjusted as if it had `bounced' off of the boundary during the 
+%    time step. Blue: The particle continues being modeled after the correction in the next 
+%    time step.}
+%    \label{fig:reflective-BC}
+%\end{figure}
+\ No newline at end of file
diff --git a/report/Chapters/optimization.tex b/report/Chapters/optimization.tex
new file mode 100644
index 0000000..c8ae81b
--- /dev/null
+++ b/report/Chapters/optimization.tex
@@ -0,0 +1,115 @@
+% Chapter on optimization methods
+\chapter{Optimization}
+\label{ch:Optimization}
+
+\section{Domain Decomposition}
+
+In evaluating the performance of the original simulation algorithm, a significant amount of time is spent on the 'spatial lookup' for pair-wise interactions between particles. This bottleneck emerges due to the current implementation of particle interactions where the distance between a given particle and every other particle must be evaluated in every time step, with the LJ force only applied if the particles are within the interaction radius of $r_{cut} = \sigma 2^{1/6}$. Given that the cutoff radius for the LJ potential is significantly smaller than the size of the box, a majority of these calculations are between particles that will not interact with each other. Domain decomposition serves to limit the number of these pair-wise evaluations by avoiding comparisons between distant particles entirely. 
+
+This method works by dividing the spatial domain of the simulation into a grid of square sectors; used to manage the number of interaction checks needed for each particle. Setting the side length of the sectors to be, at minimum, the interaction radius, particles separated by more than one sector are guaranteed to not interact and can thus be ignored in the short-range interaction calculation, but must be rounded up to a value which evenly divides the box length. This method limits the number of pair-wise evaluations for a given particle to only include those particles which inhabit the same sector, or the eight surrounding sectors (a.k.a the Moore Neighbourhood of that particle's sector). 
+
+\subsection{Creating the Sector Grid}
+For this implementation of domain decomposition, typewriter-style indexing is utilized, where the $(0,0)$ sector coordinate corresponds to the top-left corner of the simulation box. With the simulation box centred about the origin at $(0.0,0.0)$, we can create two functions which convert the spatial coordinates $(x,y)$ to sector coordinates $(s_x,s_y)$,
+
+\begin{equation}
+    \begin{split}
+        s_x(x) &= \left\lfloor \frac{x + L/2}{l_S} \right\rfloor\\
+        s_y(y) &= S_y - \left\lfloor \frac{y + L/2}{l_S} \right\rfloor.
+    \end{split}
+    \label{pos-to-sectCoords_DD}
+\end{equation}
+
+Additionally, each sector is also assigned a unique ID to allow all sector-based data to be stored in 1D arrays. To determine these unique IDs, we define a position-to-index function,
+
+\begin{equation}
+    S_{ID}(x,y) = s_x(x) + s_y(y)S_y,
+    \label{pos-to-index_DD}
+\end{equation}
+which takes spatial coordinates $(x,y)$ as input and returns the sector ID using the sector coordinates calculated from \eqref{pos-to-sectCoords_DD}.
+
+\begin{figure}
+    \centering
+    \includegraphics[width=0.5\linewidth]{Figures/sectorGridDD.png}
+    \caption{Domain decomposition grid for a box size of $L=10$ centered around the origin with sector size $l_S = 1.0$ for visualization purposes. The spatial coordinates $(x,y)$ are shown on the axes, while the sector coordinate and index is written in the bottom left and top left respectively for each sector.}
+    \label{fig:sectorGrid}
+\end{figure}
+
+An example of the sector grid setup for domain decomposition is shown in figure \ref{fig:sectorGrid}, where the $L=10.0$ simulation box has been divided into $100$ sectors with length $l_S = 1.0$. 
+
+\subsection{Updating Spatial Lookup}
+After initializing the sector grid prior to the main simulation loop, the bulk of the work done by domain decomposition occurs in the middle of the Velocity Verlet process between the position update and acceleration updates. In Algorithm \ref{VVDD_algo}, the modified Velocity Verlet program is outlined with the domain decomposition functions, \code{updateParticleSector()}, \code{updateSpatialLookup()} and \code{updateNeighbourInteraction()}, added to the procedure. 
+
+Following each particle's position update, the \code{updateParticleSector($x,y$)} function takes the particle's $x$ and $y$ positions as input and assigns the corresponding sector coordinates and index to that particle using \eqref{pos-to-sectCoords_DD} and \eqref{pos-to-index_DD}. Once all particles have completed their position/sector updates, the \code{updateSpatialLookup()} function then sorts the particles by their sector index and returns the array \code{sectorStartIndices[$M$]} where the value stored at the $i^{th}$ entry is the index in the sorted particle list where the particles located in sector $i$ begin. Finally, the \code{updateNeighbourInteraction()} function is called just before the acceleration update and evaluates the net force from interaction between each particle and its neighbours in the surrounding sectors. 
+
+\begin{algorithm}[h]
+\caption{Velocity Verlet Main Loop with Domain Decomposition Functions}
+\label{VVDD_algo}
+\begin{algorithmic}
+\Ensure Sector grid has been initialized with squares of size $l_S \geq r_{cut}$
+\While{$t < t_{\max}$}
+    \For{$i = 1$ to $N$}
+        \State $v_{hx} \gets v_x + \frac{1}{2} a_x \, dt$  \Comment{1. half-step velocity}
+        \State $x \gets x + v_{hx} \, dt$                   \Comment{2. update position}
+        \State applyBoundaryConditions($x,y$)
+        \State updateParticleSector($x,y$) \Comment{update sector coords and index}
+    \EndFor
+
+    \State updateSpatialLookup(particles, sectors) \Comment{Sort particles by cell key}
+    \State updateNeighbourInteraction(particles, sectors) \Comment{Search for and evaluate interaction with neighbours}
+    \For{$i =1$ to $N$}
+        \State $a_x \gets drag*v_{hx} + noise*R(t) + F_x$   \Comment{3. acceleration}
+        \State $v_x \gets v_{hx} + \frac{1}{2} a_x \, dt$  \Comment{4. full-step velocity}
+        \State Record data
+    \EndFor
+    \State Compute ensemble averages
+    \State $t \gets t + dt$
+\EndWhile
+\end{algorithmic}
+\end{algorithm}
+
+
+
+\section{Parallelization with OpenMP}
+
+\subsection{Parallel Neighbour Search}
+%- the primary purpose of introducing parallelization is to distribute the work load of the neighbour search over multiple threads
+
+Introducing parallelization distributes the work load of the neighbour search over multiple threads. Since the neighbour search in Algorithm 9 iterates over sectors independently, the work can be divided so that each thread handles a subject of sectors. \code{Write} operations for each sector search are therefore exclusive to the thread that handles them.
+
+Race conditions are avoided by the fact that no two threads write data for the same sector, and that the data being written for each sector is not read by other threads during the force update loop-- complications were avoided by introducing scrap variables.
+
+%- we parallelize the search over sectors so that the \code{write} operations for each sector are exclusive to the thread that handles them
+
+%- race conditions are avoided by the fact that no two threads write data for the same sector, and that the data being written for each sector is not read by other threads during the force update loop.
+
+\subsection{Parallelizing the Main Loop}
+%- after introducing parallelization for the neighbour search, the performance gains we found were suboptimal to say the least.
+
+%- we found that the performance boost was being inhibited by two things:
+%    1. The parallel threads were being initialized once each time step, leading to redundant overhead
+%    2. The write-to-disk and random number generation are costly functions in comparison to simple FLOPs, and a lot of time is spend waiting on these
+    
+%- to mitigate the overhead of parallelizing threads, the fork was moved to outside the main simulation loop. Initializing the threads before the loop cut down on the total overhead from reinitializing them every step, but leaves most of the threads idle when outisde of the neighbour search. 
+
+%- we can make use of some of these idle threads by giving them tasks that can be done in parallel:
+%    1. One thread is used to write to disk (must create copies of the data to be written)
+%    2. One thread is used to generate pseudo-random numbers
+%    3. One thread is used to do the position/sector updates
+
+Testing proved parallelization to just the neighbour search resulted in suboptimal performance gains. Analysis found the performance boost was being inhibited by two things. One, the parallel threads were being initialized once each time step, leading to redundant overhead. Secondly, the disk I/O and random number generation are costly relative to simple FLOPs, and a lot of time is spent waiting for the task to conclude. 
+
+To mitigate the overhead of parallelizing threads, the fork was moved to outside the main simulation loop. Initializing the threads before the loop cut down on the total overhead from re-initializing them every step, but leaves most of the threads idle when outside of the neighbour search. These idle threads could be utilized by giving them tasks that can be done in parallel: one thread is used to write to disk (must create copies of the data to be written), one thread is used to generate pseudo-random numbers, one thread is used to do the position/sector updates.
+
+\section{Writing to Disk}
+%- even with parallelization, writing to disk proves to be a troublesome bottleneck.
+%- Until now, the particle data has been written to the trajectory file at every time step. This results in extremely dense trajectory data which is mostly redundant in the context of stochastic motion.
+%- By introducing a recording interval where data is only written every $1000$ steps, for example, we can drastically reduce the time spent writing without losing any actual information.
+
+Even with parallelization, disk I/O remains a significant bottleneck. In the baseline implementation, particle positions, and velocities are written to the trajectory file at every time step, producing an extremely dense output that is largely redundant in the context of stochastic motion. To reduce this cost, a recording interval was introduced so that trajectory data is only written every $n_{rec}$ time steps. This drastically reduces the volume of I/O without any meaningful loss of psychical information, since the motion between recorded frames is smooth and predictable relative to the timescale of interest.
+
+
+
+
+
+
+
diff --git a/report/Chapters/results.tex b/report/Chapters/results.tex
new file mode 100644
index 0000000..82077ba
--- /dev/null
+++ b/report/Chapters/results.tex
@@ -0,0 +1,80 @@
+\chapter{Results}
+
+\section{Simulation Performance}
+
+\subsection{Domain Decomposition}
+To evaluate the performance gained from domain decomposition, simulations of different numbers of particles were run with increasing numbers of sectors, with each simulation consisting of 100 thousand time steps. The $N^2$ scaling of the original algorithm came from the need to evaluate distances between each of the $N$ particles with the other $N-1$ particles. By implementing domain decomposition, each of the $N$ particles only needs to be evaluated with particles in the $9$ surrounding sectors. This means that on average, the number of interaction calculations will scale as $\propto N \left( 9 \frac{N}{S}\right)$. The wall time should then grow similarly, plus the time taken to iterate over the sectors which scales as $S$. The expected scaling of wall time for domain decomposition is then 
+\begin{equation}
+    W \propto N\left(9\frac{N}{S}\right) + S
+\end{equation}
+
+\begin{figure}[h]
+    \centering
+    %\includegraphics[width=0.5\linewidth]{Figures/time_vs_N.png}
+    \input{Figures/fig41.tex}
+    \caption{Wall times versus the number of particles for different numbers of sectors. }
+    \label{fig:time_vs_N}
+\end{figure}
+
+
+\subsubsection{Optimal Number of Sectors}
+It was seen in the previous result that the number of sectors $S$ that minimizes the wall time changes with the number of particles $N$, suggesting that there is an optimal number of sectors to use for a given number of particles based on the density of particles per sector, $\rho$. Here, we investigate this relationship and seek to identify the optimal choice of $S$ for a given $N$. 
+
+The implication of an optimal number of sectors is not surprising. When the number of sectors is small, and thus their size larger, particles that occupy the same sector can still often be outside of the interaction range, resulting in many wasted distance calculations. Adding sectors at this stage reduces the search radius for interactions, leading to fewer unnecessary distance calculations and a shorter wall time. Increasing the number of sectors further will eventually result in having approximately one particle per sector on average. From this point, adding more sectors now \textit{increases} the number of wasted evaluations, this time by needing to iterate over empty sectors. Based on this, we can expect that the optimal number of sectors is that which yields a density of one particle per sector. 
+
+To measure the dependence of wall time on the sector densities, simulations were run for different $N$ ranging from a single particle to $N=1024$, over increasing numbers of sectors. In figure \ref{fig:time_vs_density}, the measured wall times are plotted as a function of the number of particles per sector. For any number of particles, it can be seen that the shortest wall time is achieved when the sector density is $\rho_s = 1.0$ and that the wall time increases with any deviation from this density, as expected. Thus, the optimal number of sectors for minimizing wall time is $S=N$. 
+
+\begin{figure}[H]
+    \centering
+    %\includegraphics[width=0.6\linewidth]{Figures/times_vs_density.png}
+    \input{Figures/fig42.tex}
+    \caption{Wall times versus the density of particles per sector for different $N$.}
+    \label{fig:time_vs_density}
+\end{figure}
+
+
+\subsection{Parallelization}
+
+\begin{figure}[H]
+    \centering
+    %\includegraphics[width=0.9\linewidth]{Figures/parallelization_results.pdf}
+    \input{Figures/fig43.tex}
+    \caption{Wall times versus the different numbers of threads and different $N$ values.}
+    \label{fig:parallelization_results}
+\end{figure}
+
+At lower particle counts ($N = 1024$), thread management overhead dominates and parallelization provides no meaningful benefit. At larger counts ($N = 2048$ and $N = 4096$), speedup is observed going from 2 to 4 threads, but gains diminish beyond that. The maximum observed speedup is approximately $2.6\times$ at $N = 4096$ with 8 threads, considerably below the ideal $4\times$ speedup.
+
+
+\section{Langevin Dynamics}
+
+\subsection{Root Mean Square Displacement}
+RMSD scales at $t$ in the ballistic regime, $t^{1/2}$ in the diffusive regime, and should eventually saturate as the system equilibrates (particles start to each boundary of the box). The calculation methodology for RMSD involves summing the Euclidean distance between a particle and its origin (birth position) for all relevant particles. In the case of the developed Fortran program, a ``relevant'' particle is one which has not escaped the box. The particles are always typically in motion and therefore the RMSD with vary with the simulation time. This is expressed in mathematical form as:
+\begin{equation}
+    \mathrm{RMSD}(t) = \sqrt{
+        \frac{1}{N_{\mathrm{in}}(t)}
+        \sum_{i\,\in\,\mathrm{inside}}
+        \left|\mathbf{x}_i(t) - \mathbf{x}_i(0)\right|^2
+    },
+    \label{eq:rmsd}
+\end{equation}
+Various runs were conducted to investigate the behaviour of the RMSD for the system. The transition time $\tau$ that bounds the ballistic and diffusive regimes can be expressed analytically as:
+\begin{equation}
+    \tau_p = \frac{m}{\gamma}
+\end{equation}
+where $m$ is the mass of the particle and $\gamma$ is the drag coefficient (also known as the Stokes friction coefficient) \cite{ma2012}. Worth noting is that the boundary does not represent a `hard' transition from a perfectly $t$-proportional regime to a $t^{1/2}$-proportional one, but rather the curve will tend towards the correct profile as $\left| t-\tau_p \right| \rightarrow \infty$. The behaviour is shown in Figure \ref{fig:rmsd} for various values of the friction coefficient $\gamma$:
+
+\begin{figure}[H]
+    \centering
+    \hspace{-1cm}
+    \input{Figures/rmds.tex}
+    \caption{Smoothed RMSD as a function of simulation time and the drag coefficient $\gamma$. Note the relevant transition boundary $\tau_p$ plotted for each curve, and a sample $f(t)\propto t$ is provided (dashed black line). The deviance from $t$-proportionality is clearly seen as the transition point is taken closer and closer to the starting time of $t=0$.}
+    \label{fig:rmsd}
+\end{figure}
+
+Note that in the figure above, a `smoothed' view is provided for the RMSD shown in solid lines. The raw output from the code contains some fluctuations that make visualization of the trends difficult. Smoothing is applied strictly during post-processing using \texttt{cspline} in Gnuplot. For clarity, the data points are also presented for each curve. The point density considerably increases as time goes on because data is sampled at a regular interval, so the increase in `outlier' points is considered expected. 
+
+A trend is still shown with respect to the analytical values of $\tau_p$ despite the fluctuations. The coefficient $\gamma=0.5$ yields a curve that is mostly linear until the regime boundary, at which point there is a clear deviation. The systems that transition sooner (e.g., $\gamma = 1000$) deviate from the linear trend immediately.
+
+\subsection{System Energy}
+The total system energy was investigated as part of the testing methodology for boundary conditions. It was demonstrated that significant fluctuations exist only if stochastic and drag forces are considered. If the case of a billiards-like system is considered, where particles only bounce and interact with one another, the system's kinetic energy was constant barring when particle-particle interactions occur. When particles interact, kinetic energy is transferred to the Lennard-Jones potential between both particles, and as such there are localized valleys in the system kinetic energy over time.
+\ No newline at end of file
diff --git a/report/Chapters/simulation.tex b/report/Chapters/simulation.tex
new file mode 100644
index 0000000..ce530fc
--- /dev/null
+++ b/report/Chapters/simulation.tex
@@ -0,0 +1,168 @@
+% "Methods" section - describe basic structure of LD algo without optimizations
+\chapter{Simulation}
+
+\section{Simulation Setup}
+
+The physics were first developed as pseudo-code before implementation in FORTRAN. Particle mass is normalized to $m = 1$ throughout, initial velocities are sampled using the Box--Muller transform to produce Gaussian-distributed random numbers, and positions and velocities are integrated using the velocity-Verlet scheme.
+
+\subsection{Particle Initialization}
+Before a simulation is conducted, particles must be prescribed initial positions and velocities. Positions can be chosen arbitrarily with the exception being that they should not overlap. If two particles are placed sufficiently close, a strong interaction can occur yielding significant forces and an addition of energy to the system. It is also nonphysical to have particles overlap, as in the real world they have a finite volume preventing this. A scheme was introduced to place particles at randomly chosen discrete lattice points in lieu of a random continuous position and alleviate the relevant issues. A visual overview of the placement is shown in Figure \ref{fig:box-mesh}:
+
+\begin{figure}[H]
+  \centering
+  \resizebox{0.45\textwidth}{!}{\input{Figures/mesh.pdf_tex}}
+  \caption{A visual example of stochastic particle initialization using a fine mesh structure. A lattice is defined using the outer box geometry, and lattice points are picked at random for lattice placement. Each particle is placed on the discrete grid points such that particle-particle overlap is mitigated. A border region (10\%) is excluded around the domain boundary.}
+  \label{fig:box-mesh}
+\end{figure}
+
+While this technique removes a certain degree of `randomness,' it does not constitute playing an unfair game. The positions are still randomly sampled, they just truncated to discrete points using a ceiling function. The ceiling function was chosen specifically to avoid zero-indexing mesh positions in Fortran. The algorithm for placing points is presented in Listing \ref{alg:mesh-pts}.
+
+\begin{algorithm}[H]
+\caption{Initial Particle Positions Algorithm}
+\label{alg:mesh-pts}
+\begin{algorithmic}[1]
+\State $n_{mesh} = L/r_c$ \Comment{Define the number of mesh points in each direction}
+\State $d=0.9\times L/n_{mesh}$ \Comment{And the spacing between mesh points}
+\While{$k < N$} \Comment{Loop over all particles in the system}
+    \State $l_{x,y} \gets \mathrm{PNR} \in (0,1) $ \Comment{Generate 2 pseudorandom numbers}
+    \State $i_{x,y} = \mathrm{ceiling}(l_{x,y}\times n_{mesh})$ \Comment{Transform the random numbers to a lattice position}
+    \If{$\mathrm{empty}(i_x,i_y)$} \Comment{If the position does not have a particle, place one there}
+        \State $p_{x,y} = (i_{x,y}\times d) - L/2$ \Comment{Update the $k$-th particle to have the position}
+        \State $k = k + 1$ \Comment{Move on to the next particle}
+    \EndIf
+\EndWhile
+\end{algorithmic}
+\end{algorithm}
+
+Particle velocities were sampled from a Gaussian (normal) distribution. The Box-Muller transform was applied to generate pairs of normally-distributed numbers using uniformly-distributed numbers, which were then applied for prescribing an initial $x$- and $y$-velocity to each particle.
+%\note{algorithm 2 for initial particles positions for the particle initialization is showing in the 2.1.2 particle motion section. Fix | REPLY: Add a [h] after the begin\{\} statements to force objects like figs, tables, and algos to be placed '\textbf{h}ere.}
+
+\subsection{Particle Motion}
+Each particle experiences three forces at every time step: viscous drag, a stochastic thermal kick, and WCA repulsion from neighbouring particles. With $m = 1$ throughout, force and acceleration are interchangeable.
+
+The drag force is evaluated at the half-step velocity, consistent with the Verlet update order. Initial velocities are drawn from the Maxwell distribution using the Box--Muller transform, giving each component a Gaussian distribution with zero mean and variance proportional to the system temperature, consistent with thermal equilibrium. Positions and velocities are integrated using the velocity-Verlet scheme:
+
+\begin{algorithm}[H]
+\caption{Velocity Verlet Integration Loop}
+\begin{algorithmic}[1]
+\State Initialise particle positions, velocities, accelerations
+\While{$t < t_{\max}$}
+    \For{$i = 1$ to $N$}
+        \State $v_{hx} \gets v_x + \frac{1}{2} a_x \, dt$ \Comment{half-step velocity}
+        \State $x \gets x + v_{hx} \, dt$                  \Comment{update position}
+        \State Impose boundary conditions
+    \EndFor
+    \For{$i = 1$ to $N$}
+        \State Compute forces and accelerations at new position
+        \State $v_x \gets v_{hx} + \frac{1}{2} a_x \, dt$ \Comment{full-step velocity}
+        \State Record data
+    \EndFor
+    \State Compute ensemble averages
+    \State $t \gets t + dt$
+\EndWhile
+\end{algorithmic}
+\end{algorithm}
+
+The order of operations is important: positions must be updated before forces are computed, and half-step velocities must be used consistently throughout each step.
+
+\subsection{Boundary Conditions}
+
+The simulation domain is a square box $[-L/2,\,L/2]$ with reflective walls, particles bounce off boundaries rather than wrapping around (which would be periodic boundary conditions). The reflection rule for, e.g., the top wall is:
+
+\begin{algorithm}[H]
+\caption{Reflective boundary condition with consistency check}
+\label{alg:bc}
+\begin{algorithmic}[1]
+\If{$y(i) > L/2$}
+    \State $y(i) \gets L - y(i)$ \Comment{reflect position}
+    \State $vhy(i) \gets -vhy(i)$ \Comment{reverse normal velocity}
+\EndIf
+\end{algorithmic}
+\end{algorithm}
+
+The same logic applies to all four walls. An \texttt{outside} flag is set if a particle cannot be reflected back inside (e.g.\ after an extreme overshoot), in which case it is excluded from further updates. After every call to \texttt{impose\_BC(i)}, the consistency check verifies the particle is either inside the box or flagged as outside; if this never triggers, the BC logic is internally consistent.
+
+Three tests were used to validate the boundary condition implementation.
+
+\paragraph{Test 1: Logical consistency check.}
+The consistency check was embedded in Algorithm \ref{alg:bc} was run throughout all simulations. No BC errors were triggered in any test case.
+
+\paragraph{Test 2: Individual trajectories.}
+Trajectories of particles 1, 2, and 3 were logged to separate files and plotted to visually confirm smooth, physically reasonable motion with correct reflections at the walls.
+
+\begin{figure}[H]
+    \centering
+    \subcaptionbox{Working principle of a specular reflective boundary 
+    condition. Red: incident particle penetrates boundary. Magenta: 
+    position corrected as if bounced. Blue: particle continues in next 
+    timestep.\label{fig:reflective-BC}}[0.50\textwidth]
+        {\resizebox{0.45\textwidth}{!}{\input{Figures/reflectiveBC.pdf_tex}}}
+    \hfill
+    \subcaptionbox{Trajectories of particles 1 (in red), 2 (in green), and 3 (in blue) showing correct 
+    reflections at the boundary walls.\label{fig:trajectories}}[0.45\textwidth]
+        {\def\svgwith{0.40\textwidth}\input{Figures/trajectories.tex}}
+    \caption{Boundary condition behaviour.}
+    \label{fig:BC-combined}
+\end{figure}
+
+\paragraph{Test 3: Stress test with large time step.}
+Setting $dt = 1.0$ forces large overshoots to test whether \texttt{impose\_BC} handles extreme cases. No failure case was triggered, which is consistent with correct behaviour but not conclusive on its own — a particle can overshoot by more than $L$ in a single step and a single reflection may not be sufficient to bring it back inside.
+
+\subsection{Particle Interactions}
+
+Adding pairwise interactions increases the computational complexity from $\mathcal{O}(N)$ to $\mathcal{O}(N^2)$, since each particle must be compared with every other particle at each timestep. This pairwise force evaluation constitutes the dominant computational cost of the simulation.
+
+To ensure particles actually interact during testing, the particle density was increased so that particles are close enough to fall within $r_c$ of each other. Without this, the interaction code would never be exercised.
+
+The interaction loop follows this structure:
+
+\begin{algorithm}[H]
+\caption{Force calculation with Langevin dynamics}
+\begin{algorithmic}[1]
+\For{$i = 1$ to $N$}
+    \State Apply drag and random (Langevin) forces to particle $i$
+    \ForAll{$j \neq i$}
+        \State $d_{ij} \gets \lVert x_i - x_j \rVert$
+        \If{$d_{ij} < r_c$}
+            \State $F \gets -\dfrac{d}{dr}V(r)$
+            \State Apply $F$ to particle $i$ along $r_{ij}$
+        \EndIf
+    \EndFor
+\EndFor
+\end{algorithmic}
+\end{algorithm}
+
+%\note{I wonder for this section if we could just have the algorithm or the equations 2.1 and 2.2. they essentially say the same thing | REPLY: The rubric says that WCA should be mentioned in the intro. Maybe put the equations there and keep the algo here?}
+% done - addy
+
+An output file was added to log whenever an interaction was detected, recording the positions of the interacting pair. This allowed direct verification that interactions were occurring at the right locations and not spuriously triggered.
+
+\paragraph{Energy conservation check.}
+Kinetic energy vs.\ time was plotted to verify the thermostat maintains a stable average energy. An initial version showed energy not being conserved; this was diagnosed and fixed.
+\begin{figure}[H]
+  \centering
+  %\includegraphics[width=0.62\linewidth]{Figures/walltime_vs_N.pdf}
+  \input{Figures/kinetic_energy.tex}
+  \caption{Sum of System Kinetic Energy vs. Time. Simulated with $N=1000$ particles and in one case no stochastic/drag forces, normalized to start at equal energies. Note the dips in kinetic energy when interactions are taking place and kinetic energy is being transferred to potential energy in the form of particle-particle repulsion, which later return to nominal after interactions conclude. This is not seen in the run with stochastic forces enabled.}
+  \label{fig:kinetic-energy}
+\end{figure}
+%\note{does anyone have this figure? I know Lennaert made it while in class, but it might be cool to include if we have time - Addy}
+
+\section{Performance}
+\subsection{Simulation Complexity}
+
+\begin{figure}[H]
+  \centering
+  %\includegraphics[width=0.90\linewidth]{Figures/walltime_vs_N.pdf}
+  \input{Figures/timing.tex}
+  \caption{Tracked wall times vs number of particles}
+  \label{fig:Newwalltime_vs_N}
+\end{figure}
+
+Wall time grows with increasing $N$ number of particles, with underlying quadratic behaviour, consistent with the expected $\mathcal{O}(N^2)$ complexity. 
+
+\subsection{Module Performance}
+
+To identify computational bottlenecks, wall time was profiled by subroutine. The dominant cost is \textbf{pairwise interactions}: the double loop over all $i$--$j$ pairs evaluates $N(N-1)/2$ distances per time step, with each iteration requiring a square root and, when within $r_c$, several power-law evaluations for the WCA force. \textbf{Disk I/O} is a significant secondary cost, as positions and 
+velocities are written for every particle at every time step; reducing the output frequency is a straightforward way to recover performance. \textbf{Random number generation} accumulates measurable cost over many time steps due to two length-$N$ array draws per step. By contrast, \textbf{boundary condition} checks and \textbf{velocity--Verlet} integration are both $\mathcal{O}(N)$ and contribute 
+minimally to total runtime. These bottlenecks motivate the optimizations described in Chapter~\ref{ch:Optimization}.
+\ No newline at end of file