dissertation-4-dissertation/3_Implementation/implementation.tex
2021-05-14 10:59:21 +00:00

347 lines
32 KiB
TeX

%*******************************************************************************
%****************************** Third Chapter **********************************
%*******************************************************************************
\chapter{Implementation}
% **************************** Define Graphics Path **************************
\ifpdf
\graphicspath{{3_Implementation/Figs/Raster/}{3_Implementation/Figs/PDF/}{3_Implementation/Figs/}}
\else
\graphicspath{{3_Implementation/Figs/Vector/}{3_Implementation/Figs/}}
\fi
% --------------------------- Introduction --------------------------------- %
Implementation of the proxy is in two parts: software that provides a multipath layer 3 tunnel between two hosts, and the system configuration necessary to utilise this tunnel as a proxy. An overview of the software and system is presented in Figure \ref{fig:dataflow-overview}.
This chapter details this implementation in three sections. The software will be described in Sections \ref{section:implementation-packet-transport} and \ref{section:implementation-software-structure}. Section \ref{section:implementation-packet-transport} details the implementation of both TCP and UDP methods of transporting the tunnelled packets between the hosts. Section \ref{section:implementation-software-structure} explains the software's structure and dataflow. The system configuration will be described in Section \ref{section:implementation-system-configuration}. Figure \ref{fig:dataflow-overview} shows the path of packets within the proxy, and it will be referenced throughout these sections.
\begin{sidewaysfigure}
\includegraphics[width=\textheight]{overview.png}
\caption{Diagram of packet path from a client behind the proxy to a server on the Internet.}
\label{fig:dataflow-overview}
\end{sidewaysfigure}
% -------------------------------------------------------------------------- %
% -------------------------- Packet Transport ------------------------------ %
% -------------------------------------------------------------------------- %
\section{Packet Transport}
\label{section:implementation-packet-transport}
As shown in Figure \ref{fig:dataflow-overview}, the interfaces through which transport for packets is provided between the two hosts are producers and consumers. A transport pair is between a consumer on one proxy and a producer on the other, where packets enter the consumer and exit the corresponding producer. Two methods for producers and consumers are implemented: TCP and UDP. As the greedy load balancing of this proxy relies on congestion control, TCP provided an initial proof-of-concept, while UDP expands on this proof-of-concept to remove unnecessary overhead and improve performance in the case of TCP-over-TCP tunnelling. Section \ref{section:implementation-tcp} discusses the method of transporting discrete packets across the continuous byte stream of a TCP flow, before describing why this solution is not ideal. Then, Section \ref{section:implementation-udp} goes on to discuss adding congestion control to UDP datagrams, while avoiding retransmitting a proxied packet.
\subsection{TCP}
\label{section:implementation-tcp}
The requirements for greedy load balancing to function are simple: flow control and congestion control. TCP provides both of these, so was an obvious initial solution. However, TCP also provides unnecessary overhead, which will go on to be discussed further.
A TCP flow cannot be connected directly to a TUN adaptor, as the TUN adaptor accepts and outputs discrete and formatted IP packets while the TCP connection sends a stream of bytes. To resolve this, each packet sent across a TCP flow is prefixed with the length of the packet. When a TCP consumer is given a packet to send, it first sends the 32-bit length of the packet across the TCP flow, before sending the packet itself. The corresponding TCP producer then reads these 4 bytes from the TCP flow, before reading the number of bytes specified by the received number. This enables punctuation of the stream-oriented TCP flow into a packet-carrying connection.
However, using TCP to tunnel TCP packets (TCP-over-TCP) can cause a degradation in performance \citep{honda_understanding_2005}. Further, using TCP to tunnel IP packets provides a superset of the required guarantees, in that reliable delivery and ordering are guaranteed. Reliable delivery can cause a decrease in performance for tunnelled flows which may not require reliable delivery, such as a live video stream. Ordering can limit performance when tunnelling multiple streams, as a packet for a phone call could already be received, but instead has to wait in a buffer for a packet for an unrelated download to arrive.
Although the TCP implementation provides an excellent proof-of-concept, work moved to a second UDP implementation, aiming to solve some of these problems. However, the TCP implementation is functionally correct; in cases where a connection that suffers particularly high packet loss is combined with one which is more stable, TCP could be employed on the high loss connection to limit overall packet loss. The effectiveness of such a solution would be implementation specific, so is left for the architect to decide.
% --------------------------------- UDP ------------------------------------ %
\subsection{UDP}
\label{section:implementation-udp}
After initial success with the TCP proof-of-concept, work moved to developing a UDP protocol for transporting the proxied packets. UDP differs from TCP in providing a more basic mechanism for sending discrete messages, while TCP provides a stream of bytes. Implementing a UDP datagram proxy solution returns control from the kernel to the application itself, allowing much more fine-grained management of congestion control. Further, UDP provides increased performance over TCP by removing ordering guarantees, and improving the quality of TCP tunnelling compared to TCP-over-TCP. This allows maximum flexibility, as application developers should not have to avoid using TCP to maintain compatibility with my proxy.
This section first describes the special purpose congestion control mechanism designed, which uses negative acknowledgements to avoid retransmissions. This design informs the design of the UDP packet structure. Finally, this section discusses the initial implementation of congestion control, which is based on the characteristic curve of TCP New Reno \citep{henderson_newreno_2012}.
\subsection{Congestion Control}
Congestion control is most commonly applied in the context of reliable delivery. This provides a significant benefit to TCP congestion control protocols: cumulative acknowledgements. As all of the bytes should always arrive eventually, unless the connection has faulted, the acknowledgement number (ACK) can simply be set to the highest received byte. Therefore, some adaptations are necessary for such a congestion control algorithm to apply in an context where reliable delivery is not expected. Firstly, for a packet based connection, ACKing specific bytes makes little sense - a packet is atomic, and is lost as a whole unit. To account for this, sequence numbers and their respective acknowledgements will be for entire packets, as opposed to per byte.
Secondly, for an protocol that does not guarantee reliable delivery, cumulative acknowledgements are not as simple. As tunnelled packets are now allowed to never arrive within the correct function of the flow, a situation where a packet is never received would cause deadlock with an ACK that is simply set to the highest received sequence number, demonstrated in Figure \ref{fig:sequence-ack-discontinuous}. Neither side can progress once the window is full, as the sender will not receive an ACK to free up space within the window, and the receiver will not receive the missing packet to increase the ACK. In TCP, one would expect the missing packet (one above the received ACK) to be retransmitted, which allows the ACK to catch up in only one RTT. However, as retransmissions are to be avoided, the UDP solution presented here would become deadlocked - the sending side knows that the far side has not received the packet, but must not retransmit.
\begin{figure}
\hfill
\begin{subfigure}[t]{0.3\textwidth}
\centering
\begin{tabular}{|c|c|}
SEQ & ACK \\
1 & 0 \\
2 & 0 \\
3 & 2 \\
4 & 2 \\
5 & 2 \\
6 & 5 \\
6 & 6
\end{tabular}
\caption{ACKs only responding to in order sequence numbers}
\label{fig:sequence-ack-continuous}
\end{subfigure}\hfill
\begin{subfigure}[t]{0.3\textwidth}
\centering
\begin{tabular}{|c|c|}
SEQ & ACK \\
1 & 0 \\
2 & 0 \\
3 & 2 \\
5 & 3 \\
6 & 3 \\
7 & 3 \\
7 & 3
\end{tabular}
\caption{ACKs only responding to a missing sequence number}
\label{fig:sequence-ack-discontinuous}
\end{subfigure}\hfill
\begin{subfigure}[t]{0.35\textwidth}
\centering
\begin{tabular}{|c|c|c|}
SEQ & ACK & NACK \\
1 & 0 & 0 \\
2 & 0 & 0 \\
3 & 2 & 0 \\
5 & 2 & 0 \\
6 & 2 & 0 \\
7 & 6 & 4 \\
7 & 7 & 4
\end{tabular}
\caption{ACKs and NACKs responding to a missing sequence number}
\label{fig:sequence-ack-nack-discontinuous}
\end{subfigure}
\caption{Congestion control responding to correct and missing sequence numbers of packets.}
\label{fig:sequence-ack-nack-comparison}
\hfill
\end{figure}
I present a solution based on Negative Acknowledgements (NACKs). When the receiver believes that it will never receive a packet, it increases the NACK to the highest missing sequence number, and sets the ACK to one above the NACK. This occurs after a timeout that is presently set at $3*RTT$ (Round Trip Time). The ACK algorithm is then performed to grow the ACK as high as possible. This is simplified to any change in NACK representing at least one lost packet, which can be used by the specific congestion control algorithms to react. Though this usage of the NACK appears to provide a close approximation to ACKs on reliable delivery, the choice of how to use the ACK and NACK fields is delegated to the congestion controller implementation, allowing for different implementations if they better suit the method of congestion control. Using NACKs, the deadlock in Figure \ref{fig:sequence-ack-nack-discontinuous} can be avoided, with the case in Figure \ref{fig:sequence-ack-nack-comparison} occurring instead. The NACK is used to inform the far side that a packet was lost, and therefore allow it to continue sending fresh packets. In contrast, TCP would retransmit the missing packet, which can be avoided with this NACK-based solution.
Given the decision to use ACKs and NACKs, the packet structure for UDP datagrams can now be designed. The chosen structure is given in Figure \ref{fig:udp-packet-structure}. The congestion control header consists of the sequence number and the ACK and NACK, each 32-bit unsigned integers.
\begin{figure}
\centering
\begin{bytefield}[bitwidth=0.6em]{32}
\bitheader{0-31} \\
\begin{rightwordgroup}{UDP\\Header}
\bitbox{16}{Source port} & \bitbox{16}{Destination port} \\
\bitbox{16}{Length} & \bitbox{16}{Checksum}
\end{rightwordgroup} \\
\begin{rightwordgroup}{CC\\Header}
\bitbox{32}{Acknowledgement number} \\
\bitbox{32}{Negative acknowledgement number} \\
\bitbox{32}{Sequence number}
\end{rightwordgroup} \\
\wordbox[tlr]{1}{Proxied IP packet} \\
\skippedwords \\
\wordbox[blr]{1}{} \\
\begin{rightwordgroup}{Security\\Footer}
\wordbox[tlr]{1}{Security footer} \\
\wordbox[blr]{1}{$\cdots$}
\end{rightwordgroup}
\end{bytefield}
\caption{UDP packet structure}
\label{fig:udp-packet-structure}
\end{figure}
\subsubsection{New Reno}
TCP New Reno \citep{henderson_newreno_2012} is widely known for its sawtooth pattern of throughput. New Reno is an RTT-based congestion control mechanism, which, in the steady state, increases the window size (number of packets in flight at a time) by 1 for each successful window. In the case of a retransmission, this quantity halves. The window size is the quantity of packets that can be in flight at one time, which depends on the round trip time, as a longer round trip time requires a larger window size to transmit the same amount of packets. For a freshly started New Reno connection, slow start occurs, which increases the window size by 1 for each packet transmitted successfully, as opposed to each full window of packets. This creates an exponential curve, which stops on the first transmission failure.
An algorithm that performs similarly but takes advantage of NACKs works identically for a flawless connection. That is, if no packets are lost, the implementation is identical. This includes increasing the window size by one for each successfully transmitted packet initially, and dropping to increasing by one for each window size later in the process. The difference from TCP's mechanisms arises when packets are lost, and more specifically, how that is detected. This is the NACK mechanism, which sets the NACK to the missing packet if a packet has been waiting for more than $0.5*RTT$ to be acknowledged. This occurs when packet 4 arrives before packet 3, and packet 3 has still not arrived after an additional half of the round trip time (the entire time expected for the packet to arrive), and would cause the NACK field on the next packet to be set to 3, with the ACK field set to 4. When the sender receives this NACK response, it affects the window size as TCP would (halving the size, and stopping slow start).
The congestion control algorithm has multiple threads accessing it at any one time, so uses a mixture of atomic operations and fine-grained locking to remain consistent. The \texttt{ack}, \texttt{nack} and \texttt{windowSize} fields all use atomic operations, such that they can be read immediately and allow a packet to almost be sent without gaining a lock. However, the \texttt{inFlight} and \texttt{awaitingAck} fields are each protected by a mutex, ensuring that they remain consistent. This is a compromise between performance and correctness, limiting code complexity while allowing more performance than coarse-grained locks. Further, high-level data structures (specifically, growable lists) are used, which reduce programming complexity at the cost of some performance. This allows for good readability, and increases the likelihood of writing correct code.
Congestion control is one of the main point for tests in the repository. The New Reno controller was developed mostly with test-driven development, due to the complicated interactions between threads. Though the testing of multithreaded code can be extremely challenging due to the risk of deadlock when the code is incorrect, large timeouts and a CI environment made this quite manageable.
% -------------------------------------------------------------------------- %
% ------------------------- Software Structure ----------------------------- %
% -------------------------------------------------------------------------- %
\section{Software Structure}
\label{section:implementation-software-structure}
This section details the design decisions behind the application structure, and how it fits into the systems where it will be used. Much of the focus is on the flexibility of the interfaces to future additions, while also describing the concrete implementations available with the software as of this work.
% ---------------------- Running the Application --------------------------- %
\subsection{Running the Application}
Initially, the application suffered from a significant race condition when starting. The application followed a standard flow, where it created a TUN adaptor to receive IP packets and then began proxying the packets from/to it. However, when running the application, no notification was received when this TUN adaptor became available. As such, any configuration completed on the TUN adaptor was racing with the TUN adaptor's creation, resulting in many start failures.
The software now runs in much the same way as other daemons you would launch, leading to a similar experience as other applications. The primary inspiration for the functionality of the application is Wireguard \citep{donenfeld_wireguard_2017}, specifically \verb'wireguard-go'\footnote{\url{https://github.com/WireGuard/wireguard-go}}. To launch the application, the following shell command is used:
\begin{minted}{shell-session}
netcombiner nc0
\end{minted}
When the program is executed as such, the following control flow occurs:
\begin{minted}{c}
if not child process:
c = validate_configuration()
t = new_tun(nc0)
child_process = new_process(this, c, t)
return
proxy = new_proxy(c, t)
proxy.run()
\end{minted}
Firstly, the application validates the configuration, allowing an early exit if misconfigured. Then the TUN adaptor is created. This TUN adaptor and the configuration are handed to a duplicate of the process, which sees them and begins running the given proxy. This allows the parent process to exit, while the background process continues running as a daemon.
By exiting cleanly and running the proxy in the background, the race condition is avoided. The exit is a notice to the launcher that the TUN adaptor is up and ready, allowing for further configuration steps to occur. Otherwise, an implementation specific signal would be necessary to allow the launcher of the application to move on, which conflicts with the requirement of easy future platform compatibility.
% ------------------------------ Security ---------------------------------- %
\subsection{Security}
The integrated security solution of this software is in two parts: message authentication and repeat protection. The interface for these is shared, as they perform the same action from the perspective of the producer or consumer.
\subsubsection{Message Authenticity Verification}
Message authentication is provided by a pair of interfaces, \texttt{MacGenerator} and \texttt{MacVerifier}, which add bytes at consumers and remove bytes at producers respectively. \texttt{MacGenerator} provides a method which takes input data and produces a list of bytes as output, to be appended to the message. \texttt{MacVerifier} takes the appended bytes to the message, and confirms whether they are valid for that message.
The provided implementation for message authenticity uses the BLAKE2s \citep{hutchison_blake2_2013} algorithm. By using library functions, the implementation is achieved simply by matching the interface provided by the library and the interface mentioned here. This ensures clarity, and reduces the likelihood of introducing a bug.
Key exchange is presently implemented by using a secure and external channel. For example, one might configure their proxies using the Secure Shell Protocol (SSH), and would transmit the shared key over this secure channel. In future, this could be extended with external software that manages the tunnel for you, by using its own secure channel to configure the proxies with a shared key.
\subsubsection{Repeat Protection}
Repeat protection takes advantage of the same two interfaces already mentioned. To allow this to be implemented, each consumer or producer takes an ordered list of \verb'MacGenerator's or \verb'MacVerifier's. When a packet is consumed, each of the generators is run in order, operating on the data of the last. When called by a producer, this operation is completed in reverse, with each \verb'MacVerifier' stripping off the corresponding generator. An example of this is shown in Figure \ref{fig:udp-packet-dataflow}. Firstly, the data sequence number is generated, before the MAC. When receiving the packet, the MAC is first stripped, before the data sequence number. This means that the data sequence number is protected by the MAC.
One difference between repeat protection and MAC generation is that repeat protection is shared between all producers and consumers. This is in contrast to the message authenticity, which are, as implemented, specific to a producer or consumer. The currently implemented repeat protection is that of \cite{tsou_ipsec_2012}. The code sample is provided with a BSD license, so is compatible with this project, and hence was simply adapted from C to Go. This is created at a host level when building the proxy, and the same shared amongst all producers, so has to be thread safe. Producing the sequence numbers is achieved with a single atomic operation, avoiding the need to lock at all. Verifying the sequences requires altering multiple elements of an array of bytes, so uses locking to ensure consistency. Ensuring that locks are only taken when necessary makes the calls as efficient as possible.
\begin{figure}
\centering
\begin{tikzpicture}[
onenode/.style={rectangle, draw=black!60, fill=red!5, very thick, minimum size=5mm, align=center},
twonode/.style={rectangle, draw=black!60, fill=red!5, very thick, minimum size=5mm, align=center, rectangle split, rectangle split parts=2},
threenode/.style={rectangle, draw=black!60, fill=red!5, very thick, minimum size=5mm, align=center, rectangle split, rectangle split parts=3},
fournode/.style={rectangle, draw=black!60, fill=red!5, very thick, minimum size=5mm, align=center, rectangle split, rectangle split parts=4},
fivenode/.style={rectangle, draw=black!60, fill=red!5, very thick, minimum size=5mm, align=center, rectangle split, rectangle split parts=5},
bluenode/.style={rectangle, draw=black!60, fill=blue!5, very thick, minimum size=5mm, align=center},
]
% Nodes
\node[fivenode] at (0,0) (udp) {\nodepart{one} UDP Header \nodepart{two} Congestion\\Control\\Header \nodepart{three} Packet\\Data \nodepart{four} Data\\Sequence\\Number \nodepart{five} MAC};
\node[fournode] at (3,0) (mac) {\nodepart{one} Congestion\\Control\\Header \nodepart{two} Packet\\Data \nodepart{three} Data\\Sequence\\Number \nodepart{four} MAC};
\node[threenode] at (6,0) (cc) {\nodepart{one} Congestion\\Control\\Header \nodepart{two} Packet\\Data \nodepart{three} Data\\Sequence\\Number};
\node[twonode] at (9,0) (sequence) {\nodepart{one} Congestion\\Control\\Header \nodepart{two} Packet\\Data};
\node[onenode] at (12,0) (data) {Packet\\Data};
% Edges
\draw[<->] (udp.east) -- (mac.west);
\draw[<->] (mac.east) -- (cc.west);
\draw[<->] (cc.east) -- (sequence.west);
\draw[<->] (sequence.east) -- (data.west);
\end{tikzpicture}
\caption{Expansion of a UDP packet through a consumer/producer.}
\label{fig:udp-packet-dataflow}
\end{figure}
% ------------------------ Repository Overview ----------------------------- %
\subsection{Repository Overview}
A directory tree of the repository is provided in Figure \ref{fig:repository-structure}. The top level is split between \verb'code' and \verb'evaluation', where \verb'code' is compiled to produce the application binary, and \verb'evaluation' is used to verify the performance characteristics and generate graphs. The Go code is built with the Go modules system, the Java code built with Gradle, and the Python code runs in an iPython notebook. Go tests are interspersed with the code, for example in a file named \texttt{flow\_test.go}, providing tests for \texttt{flow.go} in the same directory.
\begin{figure}
\dirtree{%
.1 /.
.2 code\DTcomment{Go code for the project}.
.3 config\DTcomment{Configuration management}.
.3 crypto\DTcomment{Cryptographic methods}.
.4 sharedkey\DTcomment{Shared key MACs}.
.3 flags\DTcomment{Command line flag processing}.
.3 mocks\DTcomment{Mocks to enable testing}.
.3 proxy\DTcomment{The central proxy controller}.
.3 replay\DTcomment{Replay protection}.
.3 shared\DTcomment{Shared errors}.
.3 tcp\DTcomment{TCP flow transport}.
.3 tun\DTcomment{TUN adaptor}.
.3 udp\DTcomment{UDP datagram transport}.
.4 congestion\DTcomment{Congestion control methods}.
.3 .drone.yml\DTcomment{CI specification}.
.2 evaluation\DTcomment{Result gathering and graph generation}.
.3 java\DTcomment{Java automated result gathering}.
.3 python\DTcomment{Python graph generation}.
}
\caption{Repository folder structure.}
\label{fig:repository-structure}
\end{figure}
% -------------------------------------------------------------------------- %
% ------------------------ System Configuration ---------------------------- %
% -------------------------------------------------------------------------- %
\section{System Configuration}
\label{section:implementation-system-configuration}
The software portion of this proxy is entirely symmetric, as can be seen in Figure \ref{fig:dataflow-overview}. However, the system configuration diverges, as each side of the proxy serves a different role. Referring to Figure \ref{fig:dataflow-overview}, it can be seen that the kernel routing differs between the two nodes. Throughout, these two sides have been referred to as the local proxy and the remote proxy, with the local in the top left and the remote in the bottom right.
As the software portion of this application is implemented in user-space, it has no control over the routing of packets. Instead, a virtual interface is provided, and the kernel is instructed to route relevant packets to/from this interface. In sections \ref{section:implementation-remote-proxy-routing} and \ref{section:implementation-local-proxy-routing}, the configuration for routing the packets for the remote proxy and local proxy respectively are explained. Finally, in Section \ref{section:implementation-multi-interface-routing}, some potentially unexpected behaviour of using devices with multiple interfaces is discussed, such that the reader can avoid some of these pitfalls. Throughout this section, examples will be given for both Linux and FreeBSD. Though these examples are provided, they are one of many methods of achieving the same results.
\subsection{Remote Proxy Routing}
\label{section:implementation-remote-proxy-routing}
The common case for remote proxies is a cloud Virtual Private Server (VPS) with one public network interface. As such, some configuration is required to both proxy bidirectionally via that interface, and also use it for communication with the local proxy. Firstly, packet forwarding must be enabled for the device. On Linux this is achieved as follows:
\begin{minted}{shell-session}
sysctl -w net.ipv4.ip_forward=1
\end{minted}
\noindent
Or on FreeBSD via:
\begin{minted}{shell-session}
echo 'GATEWAY_ENABLE="YES"' >> /etc/rc.conf
\end{minted}
These instruct the kernel in each case to forward packets. However, more instructions are necessary to ensure packets are routed correctly once forwarded. For the remote proxy, this involves two things: routing the communication for the proxy to the software side, and routing items necessary to the local system to the relevant application. Both of these are achieved in the same way, involving adjustments to the local routing table on Linux, and using \verb'pf(4)' rules on FreeBSD.
\vspace{3mm} \noindent
Linux:
\inputminted{shell-session}{3_Implementation/Samples/shell/linux_remote_routing.sh}
\noindent
FreeBSD:
\inputminted{shell-session}{3_Implementation/Samples/shell/freebsd_remote_routing.sh}
These settings combined will provide the proxying effect via the TUN interface configured in software. It is also likely worth firewalling much more aggressively at the remote proxy side, as dropping packets before saturating the low bandwidth connections between the local and remote proxy improves resilience to denial of service attacks. This can be completed either with similar routing and firewall rules to those above, or externally with many cloud providers.
\subsection{Local Proxy Routing}
\label{section:implementation-local-proxy-routing}
Routing within the local proxy expects $1+N$ interfaces: one connected to the client device expecting the public IP, and $N$ connected to the wider Internet for communication with the other node. Referring to Figure \ref{fig:dataflow-overview}, it can be seen that no complex rules are required to achieve this routing, as each interface serves a different role. As such, there are three goals: ensure the packets for the remote IP are routed from the TUN to the client device and vice versa, ensuring that packets destined for the remote proxy are not routed to the client, and ensuring each connection is routed via the correct WAN connection. The first two will be covered in this section, with a discussion on the latter in the next section.
Routing the packets from/for the local proxy is pleasantly easy. Firstly, enable IP forwarding for Linux or gateway mode for FreeBSD, as seen previously. Secondly, routes must be setup. Fortunately, these routes are far simpler than those for the remote proxy. The routing for the local proxy client interface is as follows on Linux:
\inputminted{shell-session}{3_Implementation/Samples/shell/linux_local_interface.sh}
\noindent
Or on FreeBSD:
\inputminted{shell-session}{3_Implementation/Samples/shell/freebsd_local_interface.sh}
Then, on the client device, simply set the IP address statically to the remote proxy address, and the gateway to \verb'192.168.1.1'. Now the local proxy can send and receive packets to the remote proxy, but some further routing rules are needed to ensure that the packets from the proxy reach the remote proxy, and that forwarding works correctly. This falls to routing tables and \verb'pf(4)', so for Linux:
\inputminted{shell-session}{3_Implementation/Samples/shell/linux_local_routing.sh}
\noindent
FreeBSD:
\inputminted{shell-session}{3_Implementation/Samples/shell/freebsd_local_routing.sh}
These rules achieve both the listed criteria, of communicating with the remote proxy while also forwarding the packets necessary to the client. The local proxy can be extended with more functionality, such as NAT and DHCP. This allows plug and play for the client, while also allowing multiple clients to take advantage of the connection without another router present.
\subsection{Multi-Homed Behaviour}
\label{section:implementation-multi-interface-routing}
During testing, I discovered behaviour that I found surprising when it came to multi-homed hosts. Here I will detail some of this behaviour, and workarounds found to enable the software to still work well regardless.
The first piece of surprising behaviour comes from a device which has multiple interfaces lying on the same subnet. Consider a device with two Ethernet interfaces, each of which gains a DHCP IPv4 address from the same network. The first interface \verb'eth0' takes the IP \verb'10.10.0.2' and the second \verb'eth1' takes the IP \verb'10.10.0.3', each with a subnet mask of \verb'/24'. If a packet originates from userspace with source address \verb'10.10.0.2' and destination address \verb'10.10.0.1', it may leave via either \verb'eth0' or \verb'eth1'. I initially found this behaviour very surprising, as it seems clear that the packet should be delivered from \verb'eth0', as that is the interface which has the given IP. However, as the routing is completed by the source subnet, each of these interfaces match.
Although this may seem like a contrived use case, consider this: a dual WAN router lies in front of a server, which uses these two interfaces to take two IPs. Policy routing is used on the dual WAN router to allow this device control over choice of WAN, by using either of its LAN IPs. In this case, this default routing would mean that the userspace software has no control over the WAN, as one will be selected seemingly arbitrarily. The solution to this problem is manipulation of routing tables. By creating a high priority routing table for each interface, and routing packets more specifically than the default routes, the correct packets can be routed outbound via the correct interface.
The second issue follows a similar theme of IP addresses being owned by the host and not the interface which has that IP set, as Linux hosts respond to ARP requests for any of their IP addresses on all interfaces by default. This problem is known as ARP flux. Going back to our prior example of \verb'eth0' and \verb'eth1' on the same subnet, ARP flux means that if another host sends packets to \verb'10.10.0.2', they may arrive at either \verb'eth0' or \verb'eth1', and this changes with time. Once again, this is rather contrived, but also means that, for example, a private VPN IP will be responded to from the LAN a computer is on. Although this is desirable in some cases, it continues to seem like surprising default behaviour. The solution to this is also simple, a pair of kernel parameters, set by the following, resolve the issue.
\begin{minted}{shell-session}
sysctl -w net.ipv4.conf.all.arp_announce=1
sysctl -w net.ipv4.conf.all.arp_ignore=1
\end{minted}