From 29869b47ef6e3bfe461548694e49ed63c5533020 Mon Sep 17 00:00:00 2001
From: jsh77 <jsh77@cam.ac.uk>
Date: Thu, 13 May 2021 16:42:19 +0000
Subject: [PATCH] Update on Overleaf.

---
 3_Implementation/implementation.tex | 38 +++++++++++++++++------------
 1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/3_Implementation/implementation.tex b/3_Implementation/implementation.tex
index 8238d9b..925f9f3 100644
--- a/3_Implementation/implementation.tex
+++ b/3_Implementation/implementation.tex
@@ -12,13 +12,13 @@
 
 % --------------------------- Introduction --------------------------------- %
 
-Implementation of the proxy is in two parts: software that provides a multipath layer 3 tunnel between two hosts, and the system configuration necessary to utilise this tunnel as a proxy. An overview of the software and system is presented in figure \ref{fig:dataflow-overview}.
+Implementation of the proxy is in two parts: software that provides a multipath layer 3 tunnel between two hosts, and the system configuration necessary to utilise this tunnel as a proxy. An overview of the software and system is presented in Figure \ref{fig:dataflow-overview}.
 
 This chapter will detail this implementation in three sections. The software will be described in sections \ref{section:implementation-packet-transport} and \ref{section:implementation-software-structure}. Section \ref{section:implementation-packet-transport} details the implementation of both TCP and UDP methods of transporting the tunnelled packets between the hosts. Section \ref{section:implementation-software-structure} explains the software's structure and dataflow. The system configuration will be described in section \ref{section:implementation-system-configuration}, along with a discussion of some of the oddities of multipath routing, such that a reader would have enough knowledge to implement the proxy given the software. Figure \ref{fig:dataflow-overview} shows the path of packets within the proxy. As each section discusses an element of the program, where it fits within this diagram is detailed.
 
 \begin{sidewaysfigure}
 \includegraphics[width=\textheight]{overview.png}
-\caption{Diagram of packet flow within the proxy.}
+\caption{Diagram of packet path from a client behind the proxy to a server on the Internet.}
 \label{fig:dataflow-overview}
 \end{sidewaysfigure}
 
@@ -28,7 +28,7 @@ This chapter will detail this implementation in three sections. The software wil
 \section{Packet Transport}
 \label{section:implementation-packet-transport}
 
-As shown in figure \ref{fig:dataflow-overview}, the interfaces through which transport for packets is provided between the two hosts are producers and consumers. A transport pair is then between a consumer on one proxy and a producer on the other, where packets enter the consumer and exit the corresponding producer. Two methods for producers and consumers are implemented: TCP and UDP. As the greedy load balancing of this proxy relies on congestion control, TCP provided a base for a proof of concept, while UDP expands on this proof of concept to produce a remove unnecessary overhead and improve performance in the case of TCP-over-TCP tunnelling. This section discusses, in section \ref{section:implementation-tcp}, the method of transporting discrete packets across the continuous byte stream of a TCP flow, before describing why this solution is not ideal. Then, in section \ref{section:implementation-udp}, it goes on to discuss adding congestion control to UDP datagrams, while avoiding ever retransmitting a proxied packet.
+As shown in Figure \ref{fig:dataflow-overview}, the interfaces through which transport for packets is provided between the two hosts are producers and consumers. A transport pair is then between a consumer on one proxy and a producer on the other, where packets enter the consumer and exit the corresponding producer. Two methods for producers and consumers are implemented: TCP and UDP. As the greedy load balancing of this proxy relies on congestion control, TCP provided a base for a proof of concept, while UDP expands on this proof of concept to produce a remove unnecessary overhead and improve performance in the case of TCP-over-TCP tunnelling. This section discusses, in section \ref{section:implementation-tcp}, the method of transporting discrete packets across the continuous byte stream of a TCP flow, before describing why this solution is not ideal. Then, in section \ref{section:implementation-udp}, it goes on to discuss adding congestion control to UDP datagrams, while avoiding ever retransmitting a proxied packet.
 
 \subsection{TCP}
 \label{section:implementation-tcp}
@@ -47,11 +47,13 @@ Although the TCP implementation provides an excellent proof of concept and basic
 
 After initial success with the TCP proof-of-concept, work moved to developing a UDP protocol for transporting the proxied packets. UDP differs from TCP in providing a more basic mechanism for sending discrete messages, while TCP provides a stream of bytes. Implementing a UDP datagram proxy solution returns control from the kernel to the application itself, allowing much more fine-grained management of congestion control. Further, UDP provides increased performance over TCP by removing ordering guarantees, and improving the quality of TCP tunnelling compared to TCP-over-TCP. This allows maximum flexibility, as application developers should not have to avoid using TCP to maintain compatibility with my proxy.
 
-This section first describes the special purpose congestion control mechanism designed, which uses negative acknowledgements to avoid retransmissions. This design informs the design of the UDP packet structure. Finally, this section discusses the initial implementation of congestion control, which is based on the characteristic curve of TCP New Reno.
+This section first describes the special purpose congestion control mechanism designed, which uses negative acknowledgements to avoid retransmissions. This design informs the design of the UDP packet structure. Finally, this section discusses the initial implementation of congestion control, which is based on the characteristic curve of TCP New Reno \citep{henderson_newreno_2012}.
 
 \subsection{Congestion Control}
 
-Congestion control is most commonly applied in the context of reliable delivery. This provides a significant benefit to TCP congestion control protocols: cumulative acknowledgements. As all of the bytes should always arrive eventually, unless the connection has faulted, the acknowledgement number (ACK) can simply be set to the highest received byte. Therefore, some adaptations are necessary for TCP congestion control algorithms to apply in an unreliable context. Firstly, for a packet based connection, ACKing specific bytes makes little sense - a packet is atomic, and is lost as a whole unit. To account for this, sequence numbers and their respective acknowledgements will be for entire packets, as opposed to per byte. Secondly, for an unreliable protocol, cumulative acknowledgements are not as simple. As packets are now allowed to never arrive within the correct function of the flow, a situation where a packet is never received would cause deadlock with an ACK that is simply set to the highest received sequence number, demonstrated in figure \ref{fig:sequence-ack-discontinuous}. Neither side can progress once the window is full, as the sender will not receive an ACK to free up space within the window, and the receiver will not receive the missing packet to increase the ACK.
+Congestion control is most commonly applied in the context of reliable delivery. This provides a significant benefit to TCP congestion control protocols: cumulative acknowledgements. As all of the bytes should always arrive eventually, unless the connection has faulted, the acknowledgement number (ACK) can simply be set to the highest received byte. Therefore, some adaptations are necessary for such a congestion control algorithm to apply in an context where reliable delivery is not expected. Firstly, for a packet based connection, ACKing specific bytes makes little sense - a packet is atomic, and is lost as a whole unit. To account for this, sequence numbers and their respective acknowledgements will be for entire packets, as opposed to per byte.
+
+Secondly, for an protocol that does not guarantee reliable delivery, cumulative acknowledgements are not as simple. As tunnelled packets are now allowed to never arrive within the correct function of the flow, a situation where a packet is never received would cause deadlock with an ACK that is simply set to the highest received sequence number, demonstrated in Figure \ref{fig:sequence-ack-discontinuous}. Neither side can progress once the window is full, as the sender will not receive an ACK to free up space within the window, and the receiver will not receive the missing packet to increase the ACK. In TCP, one would expect the missing packet (one above the received ACK) to be retransmitted, which allows the ACK to catch up in only one RTT. However, as retransmissions are to be avoided, the UDP solution presented here would become deadlocked - the sending side knows that the far side has not received the packet, but must not retransmit.
 
 \begin{figure}
   \hfill
@@ -105,9 +107,9 @@ Congestion control is most commonly applied in the context of reliable delivery.
   \hfill
 \end{figure}
 
-I present a solution based on Negative Acknowledgements (NACKs). When the receiver believes that it will never receive a packet, it increases the NACK to the highest missing sequence number, and sets the ACK to one above the NACK. The ACK algorithm is then performed to grow the ACK as high as possible. This is simplified to any change in NACK representing at least one lost packet, which can be used by the specific congestion control algorithms to react. Though this usage of the NACK appears to provide a close approximation to ACKs on reliable delivery, the choice of how to use the ACK and NACK fields is delegated to the congestion controller implementation, allowing for different implementations if they better suit the method of congestion control.
+I present a solution based on Negative Acknowledgements (NACKs). When the receiver believes that it will never receive a packet, it increases the NACK to the highest missing sequence number, and sets the ACK to one above the NACK. This occurs after a timeout that is presently set at $3*RTT$ (Round Trip Time). The ACK algorithm is then performed to grow the ACK as high as possible. This is simplified to any change in NACK representing at least one lost packet, which can be used by the specific congestion control algorithms to react. Though this usage of the NACK appears to provide a close approximation to ACKs on reliable delivery, the choice of how to use the ACK and NACK fields is delegated to the congestion controller implementation, allowing for different implementations if they better suit the method of congestion control. Using NACKs, the deadlock in Figure \ref{fig:sequence-ack-nack-discontinuous} can be avoided, with the case in Figure \ref{fig:sequence-ack-nack-comparison} occurring instead. The NACK is used to inform the far side that a packet was lost, and therefore allow it to continue sending fresh packets. In contrast, TCP would retransmit the missing packet, which can be avoided with this NACK-based solution.
 
-Given the decision to use ACKs and NACKs, the packet structure for UDP datagrams can now be designed. The chosen structure is given in figure \ref{fig:udp-packet-structure}. The congestion control header consists of the sequence number and the ACK and NACK, each 32-bit unsigned integers.
+Given the decision to use ACKs and NACKs, the packet structure for UDP datagrams can now be designed. The chosen structure is given in Figure \ref{fig:udp-packet-structure}. The congestion control header consists of the sequence number and the ACK and NACK, each 32-bit unsigned integers.
 
 \begin{figure}
   \centering
@@ -136,9 +138,13 @@ Given the decision to use ACKs and NACKs, the packet structure for UDP datagrams
 
 \subsubsection{New Reno}
 
-The first algorithm to be implemented for UDP Congestion Control is based on TCP New Reno. TCP New Reno is a well understood and powerful congestion control protocol. RTT estimation is performed by applying $RTT_{AVG} = RTT_{AVG}*(1-x) + RTT_{SAMPLE}*x$ for each newly received packet. Packet loss is measured in two ways: negative acknowledgements when a receiver receives a later packet than expected and has not received the preceding for $0.5*RTT$, and a sender timeout of $3*RTT$. The sender timeout exists to ensure that even if the only packet containing a NACK is dropped, the sender does not deadlock, though this case should be rare with a busy connection.
+TCP New Reno \citep{henderson_newreno_2012} is widely known for its sawtooth pattern of throughput. New Reno is an RTT-based congestion control mechanism, which, in the steady state, increases the window size (number of packets in flight at a time) by 1 for each successful window. In the case of a retransmission, this quantity halves. The window size is the quantity of packets that can be in flight at one time, which depends on the round trip time, as a longer round trip time requires a larger window size to transmit the same amount of packets. For a freshly started New Reno connection, slow start occurs, which increases the window size by 1 for each packet transmitted successfully, as opposed to each full window of packets. This creates an exponential curve, which stops on the first transmission failure.
 
-To achieve the same curve as New Reno, there are two phases: exponential growth and congestion avoidance. On flow start, using a technique known as slow start, for every packet that is acknowledged, the window size is increased by one. When a packet loss is detected (using either of the two aforementioned methods), slow start ends, and the window size is halved. Now in congestion avoidance, the window size is increased by one for every full window of packets acknowledged without loss, instead of each individual packet. When a packet loss is detected, the window size is half, and congestion avoidance continues.
+To implement an algorithm that performs similarly with works identically for a flawless connection. That is, if no packets are lost, the implementation is identical. This includes increasing the window size by one for each successfully transmitted packet initially, and dropping to increasing by one for each window size later in the process. The difference from TCP's mechanisms arises when packets are lost, and more specifically, how that is detected. This is the NACK mechanism, which sets the NACK to the missing packet if a packet has been waiting for more than $0.5*RTT$ to be acknowledged. This occurs when packet 4 arrives before packet 3, and packet 3 has still not arrived after an additional half of the round trip time (the entire time expected for the packet to arrive), and would cause the NACK field on the next packet to be set to 3, with the ACK field set to 4. When the sender receives this NACK response, it affects the window size as TCP would (halving the size, and stopping slow start).
+
+The congestion control algorithm has multiple threads accessing it at any one time, so uses a mixture of atomic operations and fine-grained locking to remain consistent. The \texttt{ack}, \texttt{nack} and \texttt{windowSize} fields all use atomic operations, such that they can be read immediately and allow a packet to almost be sent without gaining a lock. However, the \texttt{inFlight} and \texttt{awaitingAck} fields are each protected by a mutex, ensuring that they remain consistent. This is a compromise between performance and correctness, limiting code complexity while allowing more performance than coarse-grained locks. Further, high-level data structures (specifically, growable lists) are used, which reduce programming complexity at the cost of some performance. This allows for good readability, and increases the likelihood of writing correct code.
+
+Congestion control is one of the main point for tests in the repository. The New Reno controller was developed mostly with test-driven development, due to the complicated interactions between threads. Though the testing of multithreaded code can be extremely challenging due to the risk of deadlock when the code is incorrect, large timeouts and a CI environment made this quite manageable.
 
 % -------------------------------------------------------------------------- %
 % ------------------------- Software Structure ----------------------------- %
@@ -152,13 +158,13 @@ This section details the design decisions behind the application structure, and
 \subsection{Proxy}
 \label{section:implementation-proxy}
 
-The central structure for the operation of the software is the \verb'Proxy' struct. The proxy is defined by its source and sink, and provides methods for \verb'AddConsumer' and \verb'AddProducer'. The proxy coordinates the dispatching of sourced packets to consumers, and the delivery of produced packets to the sink. This follows the packet data path shown in figure \ref{fig:dataflow-overview}.
+The central structure for the operation of the software is the \verb'Proxy' struct. The proxy is defined by its source and sink, and provides methods for \verb'AddConsumer' and \verb'AddProducer'. The proxy coordinates the dispatching of sourced packets to consumers, and the delivery of produced packets to the sink. This follows the packet data path shown in Figure \ref{fig:dataflow-overview}.
 
 The proxy is implemented to take a consistent sink and source and accept consumers and producers that vary over the lifetime. This is due to the nature of producers and consumers, as each may be either ephemeral or persistent, depending on the configuration. An example is a device that accepts TCP connections and makes outbound UDP connections. In such a case, the TCP producers and consumers would be ephemeral, existing only until they are closed by the far side. The UDP producers and consumers are persistent, as control of reconnection is handled by this proxy. As the configuration is deliberately intended to be flexible, both of these can exist within the same proxy instance.
 
-The structure of the proxy is built around the flow graph in figure \ref{fig:dataflow-overview}. The packet flow demonstrates the four transfers of data that occur within the software: packet source (TUN adapter) to source queue, source queue to consumer, producer to sink queue, and sink queue to packet sink (TUN adapter). For the former and latter, these exist once for an instance of the proxy. The others run once for each consumer or producer. The lifetime of producers and consumers are controlled by the lifetime of these data flow loops and are only referenced within them, such that the garbage collector can collect any producers and consumers for which the loops have exited.
+The structure of the proxy is built around the flow graph in Figure \ref{fig:dataflow-overview}. The packet flow demonstrates the four transfers of data that occur within the software: packet source (TUN adapter) to source queue, source queue to consumer, producer to sink queue, and sink queue to packet sink (TUN adapter). For the former and latter, these exist once for an instance of the proxy. The others run once for each consumer or producer. The lifetime of producers and consumers are controlled by the lifetime of these data flow loops and are only referenced within them, such that the garbage collector can collect any producers and consumers for which the loops have exited.
 
-Finally is the aforementioned ability for the central proxy to restart consumers or producers that support it (those initiated by the proxy in question). Pseudocode for a consumer is shown in figure \ref{fig:proxy-loops-restart}. Whenever a producer or consumer terminates, and is found to be restartable, the application attempts to restart it until succeeding and re-entering the work loop.
+Finally is the aforementioned ability for the central proxy to restart consumers or producers that support it (those initiated by the proxy in question). Pseudocode for a consumer is shown in Figure \ref{fig:proxy-loops-restart}. Whenever a producer or consumer terminates, and is found to be restartable, the application attempts to restart it until succeeding and re-entering the work loop.
 
 \begin{figure}
     \begin{minted}{python}
@@ -231,7 +237,7 @@ The provided implementation for message authenticity uses the BLAKE2s \citep{hut
 
 \subsubsection{Repeat Protection}
 
-Repeat protection takes advantage of the same two interfaces already mentioned. To allow this to be implemented, each consumer or producer takes an ordered list of \verb'MacGenerator's or \verb'MacVerifier's. When a packet is consumed, each of the generators is run in order, operating on the data of the last. When produced, this operation is completed in reverse, with each \verb'MacVerifier' stripping off the corresponding generator. An example of this is shown in figure \ref{fig:udp-packet-dataflow}. Firstly, the data sequence number is generated, before the MAC. When receiving the packet, the MAC is first stripped, before the data sequence number.
+Repeat protection takes advantage of the same two interfaces already mentioned. To allow this to be implemented, each consumer or producer takes an ordered list of \verb'MacGenerator's or \verb'MacVerifier's. When a packet is consumed, each of the generators is run in order, operating on the data of the last. When produced, this operation is completed in reverse, with each \verb'MacVerifier' stripping off the corresponding generator. An example of this is shown in Figure \ref{fig:udp-packet-dataflow}. Firstly, the data sequence number is generated, before the MAC. When receiving the packet, the MAC is first stripped, before the data sequence number.
 
 One difference with repeat protection is that it is shared between all producers and consumers. This is in contrast to the message authenticity, which are thus far specific to a producer or consumer. The currently implemented repeat protection is that of \cite{tsou_ipsec_2012}. The code sample is provided with a BSD license, so is compatible with this project, and hence was simply adapted from C to Go. This is created at a host level when building the proxy, and the same shared amongst all producers, so includes locking for thread safety.
 
@@ -277,7 +283,7 @@ This demonstrates the flexibility of combining the exchange interface with other
 % ------------------------ Repository Overview ----------------------------- %
 \subsection{Repository Overview}
 
-A directory tree of the repository is provided in figure \ref{fig:repository-structure}. The top level is split between \verb'code' and \verb'evaluation', where \verb'code' is compiled into the application binary, and \verb'evaluation' is used to verify the performance characteristics and generate graphs.
+A directory tree of the repository is provided in Figure \ref{fig:repository-structure}. The top level is split between \verb'code' and \verb'evaluation', where \verb'code' is compiled into the application binary, and \verb'evaluation' is used to verify the performance characteristics and generate graphs.
 
 \begin{figure}
     \dirtree{%
@@ -308,7 +314,7 @@ A directory tree of the repository is provided in figure \ref{fig:repository-str
 \section{System Configuration}
 \label{section:implementation-system-configuration}
 
-The software portion of this proxy is entirely symmetric, as can be seen in figure \ref{fig:dataflow-overview}. However, the system configuration diverges, as each side of the proxy serves a different role. Referring to figure \ref{fig:dataflow-overview}, it can be seen that the kernel routing differs between the two nodes. Throughout, these two sides have been referred to as the local proxy and the remote proxy, with the local in the top left and the remote in the bottom right.
+The software portion of this proxy is entirely symmetric, as can be seen in Figure \ref{fig:dataflow-overview}. However, the system configuration diverges, as each side of the proxy serves a different role. Referring to Figure \ref{fig:dataflow-overview}, it can be seen that the kernel routing differs between the two nodes. Throughout, these two sides have been referred to as the local proxy and the remote proxy, with the local in the top left and the remote in the bottom right.
 
 As the software portion of this application is implemented in user-space, it has no control over the routing of packets. Instead, a virtual interface is provided, and the kernel is instructed to route relevant packets to/from this interface. In sections \ref{section:implementation-remote-proxy-routing} and \ref{section:implementation-local-proxy-routing}, the configuration for routing the packets for the remote proxy and local proxy respectively are explained. Finally, in section \ref{section:implementation-multi-interface-routing}, some potentially unexpected behaviour of using devices with multiple interfaces is discussed, such that the reader can avoid some of these pitfalls. Throughout this section, examples will be given for both Linux and FreeBSD. Though these examples are provided, they are one of many methods of achieving the same results.
 
@@ -345,7 +351,7 @@ These settings combined will provide the proxying effect via the TUN interface c
 \subsection{Local Proxy Routing}
 \label{section:implementation-local-proxy-routing}
 
-Routing within the local proxy expects $1+N$ interfaces: one connected to the client device expecting the public IP, and $N$ connected to the wider Internet for communication with the other node. Referring to figure \ref{fig:dataflow-overview}, it can be seen that no complex rules are required to achieve this routing, as each interface serves a different role. As such, there are three goals: ensure the packets for the remote IP are routed from the TUN to the client device and vice versa, ensuring that packets destined for the remote proxy are not routed to the client, and ensuring each connection is routed via the correct WAN connection. The first two will be covered in this section, with a discussion on the latter in the next section.
+Routing within the local proxy expects $1+N$ interfaces: one connected to the client device expecting the public IP, and $N$ connected to the wider Internet for communication with the other node. Referring to Figure \ref{fig:dataflow-overview}, it can be seen that no complex rules are required to achieve this routing, as each interface serves a different role. As such, there are three goals: ensure the packets for the remote IP are routed from the TUN to the client device and vice versa, ensuring that packets destined for the remote proxy are not routed to the client, and ensuring each connection is routed via the correct WAN connection. The first two will be covered in this section, with a discussion on the latter in the next section.
 
 Routing the packets from/for the local proxy is pleasantly easy. Firstly, enable IP forwarding for Linux or gateway mode for FreeBSD, as seen previously. Secondly, routes must be setup. Fortunately, these routes are far simpler than those for the remote proxy. The routing for the local proxy client interface is as follows on Linux: