Implementation of the proxy is in two parts: software that provides a multipath layer 3 tunnel between two hosts, and the system configuration necessary to utilise this tunnel as a proxy. An overview of the software and system is presented in figure \ref{fig:dataflow-overview}.
This chapter will detail this implementation in three sections. The software will be described in sections \ref{section:implementation-packet-transport} and \ref{section:implementation-software-structure}. Section \ref{section:implementation-packet-transport} details the implementation of both TCP and UDP methods of transporting the tunnelled packets between the hosts. Section \ref{section:implementation-software-structure} explains the software's structure and dataflow. The system configuration will be described in section \ref{section:implementation-system-configuration}, along with a discussion of some of the oddities of multipath routing, such that a reader would have enough knowledge to implement the proxy given the software. Figure \ref{fig:dataflow-overview} shows the path of packets within the proxy. As each section discusses an element of the program, where it fits within this diagram is detailed.
As shown in figure \ref{fig:dataflow-overview}, the interfaces through which transport for packets is provided between the two hosts are producers and consumers. A transport pair is then between a consumer on one proxy and a producer on the other, where packets enter the consumer and exit the corresponding producer. Two methods for producers and consumers are implemented: TCP and UDP. As the greedy load balancing of this proxy relies on congestion control, TCP provided a base for a proof of concept, while UDP expands on this proof of concept to produce a remove unnecessary overhead and improve performance in the case of TCP-over-TCP tunnelling. This section discusses, in section \ref{section:implementation-tcp}, the method of transporting discrete packets across the continuous byte stream of a TCP flow, before describing why this solution is not ideal. Then, in section \ref{section:implementation-udp}, it goes on to discuss adding congestion control to UDP datagrams, while avoiding ever retransmitting a proxied packet.
The requirements for greedy load balancing to function are simple: flow control and congestion control. TCP provides both of these, so was an obvious initial solution. However, TCP also provides unnecessary overhead, which will go on to be discussed further.
A TCP flow cannot be connected directly to a TUN adapter, as the TUN adapter accepts and outputs discrete and formatted IP packets while the TCP connection sends a stream of bytes. To resolve this, each packet sent across a TCP flow is prefixed with the length of the packet. When a TCP consumer is given a packet to send, it first sends the 32-bit length of the packet across the TCP flow, before sending the packet itself. The corresponding TCP producer then reads these 4 bytes from the TCP flow, before reading the number of bytes specified by the received number. This enables punctuation of the stream-oriented TCP flow into a packet-carrying connection.
However, using TCP to tunnel TCP packets (known as TCP-over-TCP) can cause a degradation in performance in non-ideal circumstances \citep{honda_understanding_2005}. Further, using TCP to tunnel IP packets provides a superset of the required guarantees, in that reliable delivery and ordering are guaranteed. Reliable delivery can cause a decrease in performance for tunnelled flows which do not require reliable delivery, such as a live video stream - a live stream does not wish to wait for a packet to be redelivered from a portion that is already played, and thus will spend longer buffering than if it received the up to date packets instead. Ordering can limit performance when tunnelling multiple streams, as a packet for a phone call could already be received, but instead has to wait in a buffer for a packet for a download to arrive, increasing latency unnecessarily.
Although the TCP implementation provides an excellent proof of concept and basic implementation, work moved to a second UDP implementation, aiming to solve some of these problems. However, the TCP implementation is functionally correct, so is left as an option, furthering the idea of flexibility maintained throughout this project. In cases where a connection that suffers particularly high packet loss is combined with one which is more stable, TCP could be employed on the high loss connection to limit overall packet loss. The effectiveness of such a solution would be implementation specific, so is left for the architect to decide.
After initial success with the TCP proof-of-concept, work moved to developing a UDP protocol for transporting the proxied packets. UDP differs from TCP in providing a more basic mechanism for sending discrete messages, while TCP provides a stream of bytes. Implementing a UDP datagram proxy solution returns control from the kernel to the application itself, allowing much more fine-grained management of congestion control. Further, UDP provides increased performance over TCP by removing ordering guarantees, and improving the quality of TCP tunnelling compared to TCP-over-TCP. This allows maximum flexibility, as application developers should not have to avoid using TCP to maintain compatibility with my proxy.
This section first describes the special purpose congestion control mechanism designed, which uses negative acknowledgements to avoid retransmissions. This design informs the design of the UDP packet structure. Finally, this section discusses the initial implementation of congestion control, which is based on the characteristic curve of TCP New Reno.
Congestion control is most commonly applied in the context of reliable delivery. This provides a significant benefit to TCP congestion control protocols: cumulative acknowledgements. As all of the bytes should always arrive eventually, unless the connection has faulted, the acknowledgement number (ACK) can simply be set to the highest received byte. Therefore, some adaptations are necessary for TCP congestion control algorithms to apply in an unreliable context. Firstly, for a packet based connection, ACKing specific bytes makes little sense - a packet is atomic, and is lost as a whole unit. To account for this, sequence numbers and their respective acknowledgements will be for entire packets, as opposed to per byte. Secondly, for an unreliable protocol, cumulative acknowledgements are not as simple. As packets are now allowed to never arrive within the correct function of the flow, a situation where a packet is never received would cause deadlock with an ACK that is simply set to the highest received sequence number, demonstrated in figure \ref{fig:sequence-ack-discontinuous}. Neither side can progress once the window is full, as the sender will not receive an ACK to free up space within the window, and the receiver will not receive the missing packet to increase the ACK.
\begin{figure}
\hfill
\begin{subfigure}[t]{0.3\textwidth}
\centering
\begin{tabular}{|c|c|}
SEQ & ACK \\
1 & 0 \\
2 & 0 \\
3 & 2 \\
4 & 2 \\
5 & 2 \\
6 & 5 \\
6 & 6
\end{tabular}
\caption{ACKs only responding to in order sequence numbers}
\label{fig:sequence-ack-continuous}
\end{subfigure}\hfill
\begin{subfigure}[t]{0.3\textwidth}
\centering
\begin{tabular}{|c|c|}
SEQ & ACK \\
1 & 0 \\
2 & 0 \\
3 & 2 \\
5 & 3 \\
6 & 3 \\
7 & 3 \\
7 & 3
\end{tabular}
\caption{ACKs only responding to a missing sequence number}
\label{fig:sequence-ack-discontinuous}
\end{subfigure}\hfill
\begin{subfigure}[t]{0.35\textwidth}
\centering
\begin{tabular}{|c|c|c|}
SEQ & ACK & NACK \\
1 & 0 & 0 \\
2 & 0 & 0 \\
3 & 2 & 0 \\
5 & 2 & 0 \\
6 & 2 & 0 \\
7 & 6 & 4 \\
7 & 7 & 4
\end{tabular}
\caption{ACKs and NACKs responding to a missing sequence number}
\label{fig:sequence-ack-nack-discontinuous}
\end{subfigure}
\caption{Congestion control responding to correct and missing sequence numbers of packets.}
\label{fig:sequence-ack-nack-comparison}
\hfill
\end{figure}
I present a solution based on Negative Acknowledgements (NACKs). When the receiver believes that it will never receive a packet, it increases the NACK to the highest missing sequence number, and sets the ACK to one above the NACK. The ACK algorithm is then performed to grow the ACK as high as possible. This is simplified to any change in NACK representing at least one lost packet, which can be used by the specific congestion control algorithms to react. Though this usage of the NACK appears to provide a close approximation to ACKs on reliable delivery, the choice of how to use the ACK and NACK fields is delegated to the congestion controller implementation, allowing for different implementations if they better suit the method of congestion control.
Given the decision to use ACKs and NACKs, the packet structure for UDP datagrams can now be designed. The chosen structure is given in figure \ref{fig:udp-packet-structure}. The congestion control header consists of the sequence number and the ACK and NACK, each 32-bit unsigned integers.
The first algorithm to be implemented for UDP Congestion Control is based on TCP New Reno. TCP New Reno is a well understood and powerful congestion control protocol. RTT estimation is performed by applying $RTT_{AVG}= RTT_{AVG}*(1-x)+ RTT_{SAMPLE}*x$ for each newly received packet. Packet loss is measured in two ways: negative acknowledgements when a receiver receives a later packet than expected and has not received the preceding for $0.5*RTT$, and a sender timeout of $3*RTT$. The sender timeout exists to ensure that even if the only packet containing a NACK is dropped, the sender does not deadlock, though this case should be rare with a busy connection.
To achieve the same curve as New Reno, there are two phases: exponential growth and congestion avoidance. On flow start, using a technique known as slow start, for every packet that is acknowledged, the window size is increased by one. When a packet loss is detected (using either of the two aforementioned methods), slow start ends, and the window size is halved. Now in congestion avoidance, the window size is increased by one for every full window of packets acknowledged without loss, instead of each individual packet. When a packet loss is detected, the window size is half, and congestion avoidance continues.
This section details the design decisions behind the application structure, and how it fits into the systems where it will be used. Much of the focus is on the flexiblity of the interfaces to future additions, while also describing the concrete implementations available with the software as of this work.
The central structure for the operation of the software is the \verb'Proxy' struct. The proxy is defined by its source and sink, and provides methods for \verb'AddConsumer' and \verb'AddProducer'. The proxy coordinates the dispatching of sourced packets to consumers, and the delivery of produced packets to the sink. This follows the packet data path shown in figure \ref{fig:dataflow-overview}.
The proxy is implemented to take a consistent sink and source and accept consumers and producers that vary over the lifetime. This is due to the nature of producers and consumers, as each may be either ephemeral or persistent, depending on the configuration. An example is a device that accepts TCP connections and makes outbound UDP connections. In such a case, the TCP producers and consumers would be ephemeral, existing only until they are closed by the far side. The UDP producers and consumers are persistent, as control of reconnection is handled by this proxy. As the configuration is deliberately intended to be flexible, both of these can exist within the same proxy instance.
The structure of the proxy is built around the flow graph in figure \ref{fig:dataflow-overview}. The packet flow demonstrates the four transfers of data that occur within the software: packet source (TUN adapter) to source queue, source queue to consumer, producer to sink queue, and sink queue to packet sink (TUN adapter). For the former and latter, these exist once for an instance of the proxy. The others run once for each consumer or producer. The lifetime of producers and consumers are controlled by the lifetime of these data flow loops and are only referenced within them, such that the garbage collector can collect any producers and consumers for which the loops have exited.
Finally is the aforementioned ability for the central proxy to restart consumers or producers that support it (those initiated by the proxy in question). Pseudocode for a consumer is shown in figure \ref{fig:proxy-loops-restart}. Whenever a producer or consumer terminates, and is found to be restartable, the application attempts to restart it until succeeding and re-entering the work loop.
The configuration format chosen was INI, extended with duplicate names. Included is a single Host section, followed by multiple Peer sections specific to a method of communicating with the other side of the proxy. Processing the configuration file is split into three parts: loading the configuration file into a Go struct, validating the configuration file, and building a proxy from the loaded configuration.
Validation of the configuration file is included to discover configuration errors prior to building an invalid proxy. Firstly, this ensures that all parts of the program built from the configuration are given values which are invalid in context and easily verifiable, such as a TCP port of above 65,535. Secondly, catching errors in configuration before attempting to build the proxy constrains the errors of an invalid configuration to a single location. For a user, this might mean that an error such as \verb'Peer[1].LocalPort invalid: max 65535; given 74523' is shown, as opposed to \verb'tcp: invalid address', which more clearly explains the user's error.
Once a configuration is validated, the proxy is built. This is a simple case of creating the proxy from the given data and adding the producers and consumers for its successful running, given that the provided configuration can already be built. Whereas other packages function in terms of interfaces, the builder package ties together all of the pieces to produce a working proxy from the configuration.
% ------------------------- Sources and Sinks ------------------------------ %
\subsection{Sourcing and Sinking Packets}
Packets that wish to leave the software leave via a sink, and packets entering arrive via a source. As the application is developed in user space, the solution that is most flexible here is a TUN adapter. A TUN adapter provides a file like interface to the layer 3 networking stack of a system.
Originally it was intended to use the Go library \verb'taptun' for TUN driver interaction, but this library ended up lacking platform compatibility that I was aiming for with this project. Fortunately, the \verb'wireguard-go' project has excellent compatibility for TUN adapters, and is licensed under the MIT-license. This allows me to instead rely on this as a library, increasing the software's compatibility significantly.
Initially, the application suffered from a significant race condition when starting. The application followed a standard flow, where it created a TUN adapter to receive IP packets and then began proxying the packets from/to it. However, when running the application, no notification was received when this TUN adapter became available. As such, any configuration completed on the TUN adapter was racing with the TUN adapter's creation, resulting in many start failures.
The software now runs in much the same way as other daemons you would launch, leading to a similar experience as other applications. The primary inspiration for the functionality of the application is Wireguard \citep{donenfeld_wireguard_2017}, specifically \verb'wireguard-go'\footnote{\url{https://github.com/WireGuard/wireguard-go}}. To launch the application, the following shell command is used:
Firstly, the application validates the configuration, allowing an early exit if misconfigured. Then the TUN adapter is created. This TUN adapter and the configuration are handed to a duplicate of the process, which sees them and begins running the given proxy. This allows the parent process to exit, while the background process continues running as a daemon.
By exiting cleanly and running the proxy in the background, the race condition is avoided. The exit is a notice to the launcher that the TUN adapter is up and ready, allowing for further configuration steps to occur. Otherwise, an implementation specific signal would be necessary to allow the launcher of the application to move on, which conflicts with the requirement of easy future platform compatibility.
The integrated security solution of this software is in three parts: message authentication, repeat protection, and cryptographic exchanges. The interfaces for each of these and their implementations are described in this section.
\subsubsection{Message Authenticity Verification}
Message authentication is provided by a pair of interfaces, \verb'MacGenerator' and \verb'MacVerifier'. \verb'MacGenerator' provides a method which takes input data and produces a sum as output, while \verb'MacVerifier' confirms that the given sum is valid for the given data.
The provided implementation for message authenticity uses the BLAKE2s \citep{hutchison_blake2_2013} algorithm. By using library functions, the implementation is achieved simply by matching the interface provided by the library and the interface mentioned here. This ensures clarity, and reduces the likelihood of introducing a bug.
Repeat protection takes advantage of the same two interfaces already mentioned. To allow this to be implemented, each consumer or producer takes an ordered list of \verb'MacGenerator's or \verb'MacVerifier's. When a packet is consumed, each of the generators is run in order, operating on the data of the last. When produced, this operation is completed in reverse, with each \verb'MacVerifier' stripping off the corresponding generator. An example of this is shown in figure \ref{fig:udp-packet-dataflow}. Firstly, the data sequence number is generated, before the MAC. When receiving the packet, the MAC is first stripped, before the data sequence number.
One difference with repeat protection is that it is shared between all producers and consumers. This is in contrast to the message authenticity, which are thus far specific to a producer or consumer. The currently implemented repeat protection is that of \cite{tsou_ipsec_2012}. The code sample is provided with a BSD license, so is compatible with this project, and hence was simply adapted from C to Go. This is created at a host level when building the proxy, and the same shared amongst all producers, so includes locking for thread safety.
\subsubsection{Exchange}
The \verb'Exchange' interface provides for a cryptographic exchange, but is flexible enough to be used for other purposes too, as will be described for UDP congestion control. When beginning a flow, an ordered list of the \verb'Exchange' type is supplied. These exchanges are then performed in order until each has completed. If any exchange fails, the process returns to the beginning.
Currently, no cryptographic exchange is necessary, as the methods mentioned above are symmetric. However, the exchange interface is taken advantage of when beginning a UDP flow. As UDP requires an initial exchange to initiate congestion control and establish a connection with the other node, this exchange interface is used. By pairing the congestion controller with the initial UDP exchange, the exchange interacts with the congestion controller, setting up the state correctly for the continuing connection.
This demonstrates the flexibility of combining the exchange interface with other objects. Although the software does not currently implement any key exchange algorithms, this is possible with the interfaces as described. Simply provide a type that implements both \verb'Exchange' and \verb'MacGenerator'. During the exchange, the keys needed for message authentication can be inserted directly into the structure, after which it will work for the lifetime of the consumer.
A directory tree of the repository is provided in figure \ref{fig:repository-structure}. The top level is split between \verb'code' and \verb'evaluation', where \verb'code' is compiled into the application binary, and \verb'evaluation' is used to verify the performance characteristics and generate graphs.
\begin{figure}
\dirtree{%
.1 /.
.2 code\DTcomment{Go code for the project}.
.3 config\DTcomment{Configuration management}.
.3 crypto\DTcomment{Cryptographic methods}.
.4 sharedkey\DTcomment{Shared key MACs}.
.3 mocks\DTcomment{Mocks to enable testing}.
.3 proxy\DTcomment{The central proxy controller}.
.3 shared\DTcomment{Shared errors}.
.3 tcp\DTcomment{TCP flow transport}.
.3 tun\DTcomment{TUN adapter}.
.3 udp\DTcomment{UDP datagram transport}.
.4 congestion\DTcomment{Congestion control methods}.
.3 utils\DTcomment{Common data structures}.
.2 evaluation\DTcomment{Result gathering and graph generation}.
.3 java\DTcomment{Java automated result gathering}.
The software portion of this proxy is entirely symmetric, as can be seen in figure \ref{fig:dataflow-overview}. However, the system configuration diverges, as each side of the proxy serves a different role. Referring to figure \ref{fig:dataflow-overview}, it can be seen that the kernel routing differs between the two nodes. Throughout, these two sides have been referred to as the local proxy and the remote proxy, with the local in the top left and the remote in the bottom right.
As the software portion of this application is implemented in user-space, it has no control over the routing of packets. Instead, a virtual interface is provided, and the kernel is instructed to route relevant packets to/from this interface. In sections \ref{section:implementation-remote-proxy-routing} and \ref{section:implementation-local-proxy-routing}, the configuration for routing the packets for the remote proxy and local proxy respectively are explained. Finally, in section \ref{section:implementation-multi-interface-routing}, some potentially unexpected behaviour of using devices with multiple interfaces is discussed, such that the reader can avoid some of these pitfalls. Throughout this section, examples will be given for both Linux and FreeBSD. Though these examples are provided, they are one of many methods of achieving the same results.
The common case for remote proxies is a cloud Virtual Private Server (VPS) with one public network interface. As such, some configuration is required to both proxy bidirectionally via that interface, and also use it for communication with the local proxy. Firstly, packet forwarding must be enabled for the device. On Linux this is achieved as follows:
These instruct the kernel in each case to forward packets. However, more instructions are necessary to ensure packets are routed correctly once forwarded. For the remote proxy, this involves two things: routing the communication for the proxy to the software side, and routing items necessary to the local system to the relevant application. Both of these are achieved in the same way, involving adjustments to the local routing table on Linux, and using \verb'pf(4)' rules on FreeBSD.
These settings combined will provide the proxying effect via the TUN interface configured in software. It is also likely worth firewalling much more aggressively at the remote proxy side, as dropping packets before saturating the low bandwidth connections between the local and remote proxy improves resilience to denial of service attacks. This can be completed either with similar routing and firewall rules to those above, or externally with many cloud providers.
Routing within the local proxy expects $1+N$ interfaces: one connected to the client device expecting the public IP, and $N$ connected to the wider Internet for communication with the other node. Referring to figure \ref{fig:dataflow-overview}, it can be seen that no complex rules are required to achieve this routing, as each interface serves a different role. As such, there are three goals: ensure the packets for the remote IP are routed from the TUN to the client device and vice versa, ensuring that packets destined for the remote proxy are not routed to the client, and ensuring each connection is routed via the correct WAN connection. The first two will be covered in this section, with a discussion on the latter in the next section.
Routing the packets from/for the local proxy is pleasantly easy. Firstly, enable IP forwarding for Linux or gateway mode for FreeBSD, as seen previously. Secondly, routes must be setup. Fortunately, these routes are far simpler than those for the remote proxy. The routing for the local proxy client interface is as follows on Linux:
Then, on the client device, simply set the IP address statically to the remote proxy address, and the gateway to \verb'192.168.1.1'. Now the local proxy can send and receive packets to the remote proxy, but some further routing rules are needed to ensure that the packets from the proxy reach the remote proxy, and that forwarding works correctly. This falls to routing tables and \verb'pf(4)', so for Linux:
These rules achieve both the listed criteria, of communicating with the remote proxy while also forwarding the packets necessary to the client. The local proxy can be extended with more functionality, such as NAT and DHCP. This allows plug and play for the client, while also allowing multiple clients to take advantage of the connection without another router present.
During testing, I discovered behaviour that I found surprising when it came to multi-homed hosts. Here I will detail some of this behaviour, and workarounds found to enable the software to still work well regardless.
The first piece of surprising behaviour comes from a device which has multiple interfaces lying on the same subnet. Consider a device with two Ethernet interfaces, each of which gains a DHCP IPv4 address from the same network. The first interface \verb'eth0' takes the IP \verb'10.10.0.2' and the second \verb'eth1' takes the IP \verb'10.10.0.3', each with a subnet mask of \verb'/24'. If a packet originates from userspace with source address \verb'10.10.0.2' and destination address \verb'10.10.0.1', it may leave via either \verb'eth0' or \verb'eth1'. I initially found this behaviour very surprising, as it seems clear that the packet should be delivered from \verb'eth0', as that is the interface which has the given IP. However, as the routing is completed by the source subnet, each of these interfaces match.
Although this may seem like a contrived use case, consider this: a dual WAN router lies in front of a server, which uses these two interfaces to take two IPs. Policy routing is used on the dual WAN router to allow this device control over choice of WAN, by using either of its LAN IPs. In this case, this default routing would mean that the userspace software has no control over the WAN, as one will be selected seemingly arbitrarily. The solution to this problem is manipulation of routing tables. By creating a high priority routing table for each interface, and routing packets more specifically than the default routes, the correct packets can be routed outbound via the correct interface.
The second issue follows a similar theme of IP addresses being owned by the host and not the interface which has that IP set, as Linux hosts respond to ARP requests for any of their IP addresses on all interfaces by default. This problem is known as ARP flux. Going back to our prior example of \verb'eth0' and \verb'eth1' on the same subnet, ARP flux means that if another host sends packets to \verb'10.10.0.2', they may arrive at either \verb'eth0' or \verb'eth1', and this changes with time. Once again, this is rather contrived, but also means that, for example, a private VPN IP will be responded to from the LAN a computer is on. Although this is desirable in some cases, it continues to seem like surprising default behaviour. The solution to this is also simple, a pair of kernel parameters, set by the following, resolve the issue.
The final discovery I made is that many of these problems can be solved by changing the question. In my real world testing, explained in section \ref{section:real-world-testing}, the local proxy lies behind a dual WAN router. This router allows the same port to be accessible via two WAN IPs, and avoids any routing complication as the router itself handles the NAT perfectly. Prior to this I was attempting to route outbound, similar to the situation described above, with some difficulty. Hence it is worth considering whether an architecture modification can make the routing simpler for the task you are trying to achieve.