Implementation of the proxy is in two parts: software that provides a multipath layer 3 tunnel between two hosts, and the system configuration necessary to proxy as described. An overview of the system is presented in figure 3.1.
This chapter will detail this implementation in three sections. The software will be described in sections \ref{section:implementation-software-structure} and \ref{section:implementation-producer-consumer}. Section \ref{section:implementation-software-structure} explains the software's structure and dataflow. Section \ref{section:implementation-producer-consumer} details the implementation of both TCP and UDP methods of transporting the tunnelled packets between the hosts. The system configuration will be described in section \ref{section:implementation-system-configuration}, along with a discussion of some of the oddities of multipath routing, such that a reader would have enough knowledge to implement the proxy.
The central structure for the operation of the software is the \verb'Proxy' struct. The proxy is defined by its source and sink, and provides methods for \verb'AddConsumer' and \verb'AddProducer'. The proxy coordinates the dispatching of sourced packets to consumers, and the delivery of produced packets to the sink. This follows the packet data path shown in figure \ref{fig:proxy-start-data-flow}.
The proxy is implemented to take a consistent sink and source and accept consumers and producers that vary over the lifetime. This is due to the nature of producers and consumers, as each may be either ephemeral or persistent, depending on the configuration. An example is a device that accepts TCP connections and makes outbound UDP connections. In such a case, the TCP producers and consumers would be ephemeral, existing only until they are closed by the far side. The UDP producers and consumers are persistent, as control of reconnection is handled by this proxy. As the configuration is deliberately intended to be flexible, both of these can exist within the same proxy instance.
The structure of the proxy is built around the flow graph in figure \ref{fig:proxy-start-data-flow}. The flow graph demonstrates the four transfers of data that occur: packet source to source queue, source queue to consumer, producer to sink queue, and sink queue to packet sink. For the former and latter, these exist once for an instance of the proxy. The others run once for each consumer or producer. Basic examples of the logic applied for each flow are given in figure \ref{fig:proxy-loops}.
Although the pseudocode given in figure \ref{fig:proxy-loops} is incredibly simple, aside from error handling, this is as implemented in the Go code. Go's cooperative scheduler and lightweight Goroutines make this an efficient implementation. However, given the expected quantities of simultaneously connected consumers and producers is low, heavier OS threads would also be effective here. The queues are further trivial to implement in Go, as channels provide all of the necessary functionality, but can also be implemented in other languages. The lifetime of producers and consumers are controlled by the lifetime of the aforementioned loops, and are only referenced within them, such that the garbage collector can collect any producers and consumers for which the loops have exited.
Finally is the aforementioned ability for the central proxy to restart consumers or producers that support it (thus far, those initiated by the proxy in question). This causes the wrapping of the loops shown in figure \ref{fig:proxy-loops} in an additional layer. Pseudocode for the expansion to the consumer is shown in figure \ref{fig:proxy-loops-restart}, with the providers being expanded similarly.
\begin{figure}
\begin{minted}{python}
do:
while is_reconnectable(consumer) and not is_alive(consumer):
reconnect(consumer)
while is_alive(consumer):
packet = source_queue.popOrBlock()
consumer.consume(packet)
while is_reconnectable(consumer)
\end{minted}
\caption{Pseudocode for a consumer, supporting reconnection.}
The configuration format chosen was INI, extended with duplicate names. Included is a single Host section, followed by multiple Peer sections specific to a method of communicating with the other side. Processing the configuration file is split into three parts: loading the configuration file into a Go struct, validating the configuration file, and building a proxy from the loaded configuration.
Validation of the configuration file is included to discover configuration errors prior to building an invalid proxy. Firstly, this ensures that all parts of the program built from the configuration are given values which are invalid in context and easily verifiable, such as a TCP port of above 65,535. Secondly, catching errors in configuration before attempting to build the proxy constrains the errors of an invalid configuration to a single location. For a user, this might mean that an error such as \verb'Peer[1].LocalPort invalid: max 65535; given 74523' is shown, as opposed to \verb'tcp: invalid address', which shows the user's error as opposed to a bug in the code.
Once a configuration is validated, the proxy is built. This is a simple case of creating the proxy from the given data and adding the producers and consumers for its successful running.
This builder structure is also useful for a Go project, as it helps avoid circular imports, which are banned in Go. An example is a TCP flow implementing \verb'proxy.Consumer'. The proxy package cannot import TCP to create flows, so this must be delegated to another package. A builder package can bridge the gap, while maintaining a close link between the configuration and a single place where it is built.
% ---------------------- Running the Application --------------------------- %
\subsection{Running the Application}
The software is designed to run in much the same way as other daemons you would launch, leading to a similar experience as other applications. The primary inspiration for the functionality of the application is Wireguard \citep{donenfeld_wireguard_2017}, specifically \verb'wireguard-go'\footnote{\url{https://github.com/WireGuard/wireguard-go}}. To launch the application, the following shell command is used:
\begin{minted}{shell-session}
netcombiner nc0
\end{minted}
When correctly configured, after a short pause this command exits with status 0. The system then has an interface named \verb'nc0' which provides a tunnel as configured. To achieve this in a C application, a fork is used. However, the language Go is incompatible with forking, so instead a second process is spawned. Spawning a process and dying is convenient here for two reasons: if a user launches the application it leaves their shell clear, and if an init system launches the application it knows that the TUN is available as soon as the application exits.
To exit cleanly after opening the TUN adapter, the application hands over control of the adapter to a second process. To achieve this, a duplicate process is spawned with the TUN's file descriptor in slot 3, and an environment variable stating as such. When the application is started it checks for the presence of this variable, and if it finds it, knows that it is the child process. Once this process is spawned, the parent process may exit, as the child process now controls the TUN adapter and can continue the work of the proxy.
The application expects to find configuration in well known locations, \verb'/etc/netcombiner/%IF' on Linux and \verb'/usr/local/etc/netcombiner/%IF' on FreeBSD. However, the application also supports being provided with a configuration file location on the command line, ensuring flexibility with other systems, such as router operating systems based on Linux or FreeBSD, without code changes.
Packets that wish to leave the software leave via a sink, and packets entering arrive via a source. As the application is developed in user space, the solution that is most flexible here is a TUN adapter. A TUN adapter provides a file like interface to the layer 3 networking stack of a system.
Originally it was intended to use the Go library \verb'taptun' for TUN driver interaction, but this library ended up lacking platform compatibility that I was aiming for with this project. Fortunately, the \verb'wireguard-go' project has excellent compatibility for TUN adapters, and is licensed under the MIT-license. This allows me to instead rely on this as a library, increasing the software's compatibility significantly.
The integrated security solution of this software is in three parts: message authentication, repeat protection, and cryptographic exchanges. The interfaces for each of these and their implementations are described in this section.
\subsubsection{Message Authenticity Verification}
Message authentication is provided by a pair of interfaces, \verb'MacGenerator' and \verb'MacVerifier'. \verb'MacGenerator' provides a method which takes input data and produces a sum as output, while \verb'MacVerifier' confirms that the given sum is valid for the given data.
The provided implementation for message authenticity uses the BLAKE2s \citep{hutchison_blake2_2013}. By using library functions, the implementation is achieved simply by matching the interface provided by the library and the interface mentioned here. This ensures clarity, and reduces the likelihood of introducing a bug.
Repeat protection takes advantage of the same two interfaces already mentioned. To allow this to be implemented, each consumer or producer takes an ordered list of \verb'MacGenerator's or \verb'MacVerifier's. When a packet is consumed, each of the generators is run in order, operating on the data of the last. When produced, this operation is completed in reverse, with each \verb'MacVerifier' stripping off the corresponding generator. An example of this is shown in figure \ref{fig:udp-packet-dataflow}. Firstly, the data sequence number is generated, before the MAC. When receiving the packet, the MAC is first stripped, before the data sequence number.
One difference with repeat protection is that it is shared between all producers and consumers. This is in contrast to the message authenticity, which are thus far specific to a producer or consumer. The currently implemented repeat protection is that of \cite{tsou_ipsec_2012}. The code sample is provided with a BSD license, so is compatible with this project, and hence was simply adapted from C to Go. This is created at a host level when building the proxy, and the same shared amongst all producers, so includes locking for thread safety.
\subsubsection{Exchange}
The \verb'Exchange' interface provides for a cryptographic exchange, but is flexible enough to be used for other purposes too, as will be described for UDP congestion control. When beginning a flow, an ordered list of the \verb'Exchange' type is supplied. These exchanges are then performed in order until each has completed. If any exchange fails, the process returns to the beginning.
Currently, no cryptographic exchange is necessary, as the methods mentioned above are symmetric. However, the exchange interface is taken advantage of when beginning a UDP flow. As UDP requires an initial exchange to initiate congestion control and establish a connection with the other node, this exchange interface is used. By pairing the congestion controller with the initial UDP exchange, the exchange interacts with the congestion controller, setting up the state correctly for the continuing connection.
This demonstrates the flexibility of combining the exchange interface with other objects. Although the software does not currently implement any key exchange algorithms, this is possible with the interfaces as described. Simply provide a type that implements both \verb'Exchange' and \verb'MacGenerator'. During the exchange, the keys needed for message authentication can be inserted directly into the structure, after which it will work as intended.
A directory tree of the repository is provided in figure \ref{fig:repository-structure}. The top level is split between \verb'code' and \verb'evaluation', where \verb'code' is compiled into the application binary, and \verb'evaluation' is used to verify the performance characteristics and generate graphs.
As shown in figure \ref{fig:dataflow-overview} and described in section \ref{section:implementation-software-structure}, the interfaces through which transport for packets is provided between the two hosts are producers and consumers. A transport pair is then created between a consumer on one host and a producer on the other, where packets enter the consumer and exit the corresponding producer. Two methods for producers and consumers are implemented: TCP and UDP. As the greedy load balancing of this proxy relies on congestion control, TCP provided a base for a proof of concept, while UDP expands on this proof of concept to produce a usable solution. This section discusses, in section \ref{section:implementation-tcp}, the method of transporting discrete packets across the continuous byte stream of a TCP flow. Then, in section \ref{section:implementation-udp}, it goes on to discuss adding congestion control to UDP datagrams, while avoiding retransmissions.
The base implementation for producers and consumers takes advantage of TCP. The requirements for the load balancing given above to function are simple: flow control and congestion control. TCP provides both of these, so was an obvious initial solution. However, TCP also provides unnecessary overhead, which will go on to be discussed further.
TCP is a stream oriented connection, while the packets to be sent are discrete datagrams. That is, a TCP flow cannot be connected directly to a TUN adapter, as the TUN adapter expects discrete and formatted IP packets while the TCP connection sends a stream of bytes. To resolve this, each packet sent across a TCP flow is prefixed with the length of the packet. On the sending side, this involves writing the 32-bit length of the packet, followed by the packet itself. For the receiver, first 4 bytes are read to recover the length of the next packet, after which that many bytes are read. This successfully punctuates the stream oriented connection into a packet based connection.
However, using TCP to tunnel TCP packets (known as TCP-over-TCP) can cause a degradation in performance in non-ideal circumstances \citep{honda_understanding_2005}. Further, using TCP to tunnel IP packets provides a superset of the required guarantees, in that reliable delivery and ordering are guaranteed. Reliable delivery can cause a decrease in performance for tunnelled flows which do not require reliable delivery, such as a live video stream - a live stream does not wish to wait for a packet to be redelivered from a portion that is already played, and thus will spend longer buffering than if it received the up to date packets instead. Ordering can limit performance when tunnelling multiple streams, as a packet for a phone call could already be received, but instead has to wait in a buffer for a packet for a download to arrive, increasing latency unnecessarily.
Although the TCP implementation provides an excellent proof of concept and basic implementation, work moved to a second UDP implementation, aiming to solve some of these problems. However, the TCP implementation is functionally correct, so is left as an option, furthering the idea of flexibility maintained throughout this project. In cases where a connection that suffers particularly high packet loss is combined with one which is more stable, TCP could be employed on the high loss connection to limit overall packet loss. The effectiveness of such a solution would be implementation specific, so is left for the architect to decide.
To resolve the issues seen with TCP, an implementation using UDP was built as an alternative. UDP differs from TCP in that it provides almost no guarantees, and is based on sending discrete datagrams as opposed to a stream of bytes. However, UDP datagrams don't provide the congestion control or flow control required, so this must be built on top of the protocol. As the flow itself can be managed in userspace, opposed to the TCP flow which is managed in kernel space, more flexibility is available in implementation. This allows received packets to be immediately dispatched, with little regard for ordering.
Congestion control is most commonly applied in the context of reliable delivery. This provides a significant benefit to TCP congestion control protocols: cumulative acknowledgements. As all of the bytes should always arrive eventually, unless the connection has faulted, the acknowledgement number (ACK) can simply be set to the highest received byte. Therefore, some adaptations are necessary for TCP congestion control algorithms to apply in an unreliable context. Firstly, for a packet based connection, ACKing specific bytes makes little sense - a packet is atomic, and is lost as a whole unit. To account for this, sequence numbers and their respective acknowledgements will be for entire packets, as opposed to per byte. Secondly, for an unreliable protocol, cumulative acknowledgements are not as simple. As packets are now allowed to never arrive within the correct function of the flow, a situation where a packet is never received would cause deadlock with an ACK that is simply set to the highest received sequence number, demonstrated in figure \ref{fig:sequence-ack-discontinuous}. Neither side can progress once the window is full, as the sender will not receive an ACK to free up space within the window, and the receiver will not receive the missing packet to increase the ACK.
I present a solution based on Negative Acknowledgements (NACKs). When the receiver believes that it will never receive a packet, it increases the NACK to the highest missing sequence number, and sets the ACK to one above the NACK. The ACK algorithm is then performed to grow the ACK as high as possible. This is simplified to any change in NACK representing at least one lost packet, which can be used by the specific congestion control algorithms to react. Though this usage of the NACK appears to provide a close approximation to ACKs on reliable delivery, the choice of how to use the ACK and NACK fields is delegated to the congestion controller implementation, allowing for different implementations if they better suit the method of congestion control.
Given the decision to use ACKs and NACKs, the packet structure for UDP datagrams can now be designed. The chosen structure is given in figure \ref{fig:udp-packet-structure}. The congestion control header consists of the sequence number and the ACK and NACK, each 32-bit unsigned integers.
The first algorithm to be implemented for UDP Congestion Control is based on TCP New Reno. TCP New Reno is a well understood and powerful congestion control protocol. RTT estimation is performed by applying $RTT_{AVG}= RTT_{AVG}*(1-x)+ RTT_{SAMPLE}*x$ for each newly received packet. Packet loss is measured in two ways: negative acknowledgements when a receiver receives a later packet than expected and has not received the preceding for $0.5*RTT$, and a sender timeout of $3*RTT$. The sender timeout exists to ensure that even if the only packet containing a NACK is dropped, the sender does not deadlock, though this case should be rare with a busy connection.
To achieve the same curve as New Reno, there are two phases: exponential growth and congestion avoidance. On flow start, using a technique known as slow start, for every packet that is acknowledged, the window size is increased by one. When a packet loss is detected (using either of the two aforementioned methods), slow start ends, and the window size is halved. Now in congestion avoidance, the window size is increased by one for every full window of packets acknowledged without loss, instead of each individual packet. When a packet loss is detected, the window size is half, and congestion avoidance continues.
The software portion of this proxy is entirely symmetric, as can be seen in figure \ref{fig:dataflow-overview}. However, the system configuration diverges, as each side of the proxy serves a different role. Referring to figure \ref{fig:dataflow-overview}, it can be seen that the kernel routing differs between the two nodes. Throughout, these two sides have been referred to as the local and remote portals, with the local in the top left and the remote in the bottom right.
As the software portion of this application is implemented in user-space, it has no control over the routing of packets. Instead, a virtual interface is provided, and the kernel is instructed how to route relevant packets via this interface. In sections \ref{section:implementation-remote-portal-routing} and \ref{section:implementation-local-portal-routing}, the configuration for routing the packets for the remote portal and local portal respectively are explained. Finally, in section \ref{section:implementation-multi-interface-routing}, some potentially unexpected behaviour of using devices with multiple interfaces is explained, such that the reader can avoid some of these pitfalls. Throughout this section, examples will be given for both Linux and FreeBSD. Though these examples are provided, they are one of many methods of achieving the same results.
The common case for remote portals is a cloud VPS with one public network interface. As such, some configuration is required to both proxy bidirectionally via that interface, and also use it for communication with the local portal. Firstly, packet forwarding must be enabled for the device. On Linux this is achieved as follows:
These instruct the kernel in each case to forward packets. However, more instructions are necessary to ensure packets are routed correctly once forwarded. For the remote portal, this involves two things: routing the communication for the proxy to the software side, and routing items necessary to the local system to the relevant application. Both of these are achieved in the same way, involving adjustments to the local routing table on Linux, and using \verb'pf(4)' rules on FreeBSD.
These settings combined will provide the proxying effect via the TUN interface configured in software. It is also likely worth firewalling much more aggressively at the remote portal side, as dropping packets before saturating the low bandwidth connections between the local and remote portal improves resilience to denial of service attacks. This can be completed either with similar routing and firewall rules to those above, or externally with many cloud providers, and is left as an exercise.
Routing within the local portal expects $1+N$ interfaces: one connected to the client device expecting the public IP, and $N$ connected to the wider Internet for communication with the other node. Referring to figure \ref{fig:dataflow-overview}, it can be seen that no specific rules are required to achieve this routing. Although this is true in most, the overview diagram avoids the complexity of the kernel routing to this software itself, which will be discussed in more detail here. Therefore, there are three goals: ensure the packets for the remote IP are routed from the TUN to the client device and vice versa, ensuring that packets destined for the remote portal are not routed to the client, and ensuring each connection is routed via the correct WAN connection. The first two will be covered in this section, with a discussion on the latter in the next section.
Routing the packets from/for the local portal is pleasantly easy. Firstly, enable IP forwarding for Linux or gateway mode for FreeBSD, as seen previously. Secondly, routes must be setup. Fortunately, these routes are far simpler than those for the remote portal. The routing for the local portal client interface is as follows on Linux:
Then, on the client device, simply set the IP address statically to the remote portal address, and the gateway to \verb'192.168.1.1'. Now the local portal can send and receive packets to the remote portal, but some further routing rules are needed to ensure that the packets from the proxy reach the remote portal, and that forwarding works correctly. This falls to routing tables and \verb'pf(4)', so for Linux:
These rules achieve both the listed criteria, of communicating with the remote portal while also forwarding the packets necessary to the client. The local portal can be extended with more functionality, such as NAT and DHCP. This allows plug and play for the client, while also allowing multiple clients to take advantage of the connection without another router present.
During testing, I discovered behaviour that I found surprising when it came to multi-homed hosts. Here I will detail some of this behaviour, and workarounds found to enable the software to still work well regardless.
The first piece of surprising behaviour comes from a device which has multiple interfaces lying on the same subnet. Consider a device with two Ethernet interfaces, each of which gains a DHCP IPv4 address from the same network. The first interface \verb'eth0' takes the IP \verb'10.10.0.2' and the second \verb'eth1' takes the IP \verb'10.10.0.3', each with a subnet mask of \verb'/24'. If a packet originates from userspace with source address \verb'10.10.0.2' and destination address \verb'10.10.0.1', it may leave via either \verb'eth0' or \verb'eth1'. I initially found this behaviour very surprising, as it seems clear that the packet should be delivered from \verb'eth0', as that is the interface which has the given IP. However, as the routing is completed by the source subnet, each of these interfaces match.
Although this may seem like a contrived use case, consider this: a dual WAN router lies in front of a server, which uses these two interfaces to take two IPs. Policy routing is used on the dual WAN router to allow this device control over choice of WAN, by using either of its LAN IPs. In this case, this default routing would mean that the userspace software has no control over the WAN, as one will be selected seemingly arbitrarily. The solution to this problem is manipulation of routing tables. By creating a high priority routing table for each interface, and routing packets more specifically than the default routes, the correct packets can be routed outbound via the correct interface.
The second issue follows a similar theme of IP addresses being owned by the host and not the interface which has that IP set, as Linux hosts respond to ARP requests for any of their IP addresses on all interfaces by default. This problem is known as ARP flux. Going back to our prior example of \verb'eth0' and \verb'eth1' on the same subnet, ARP flux means that if another host sends packets to \verb'10.10.0.2', they may arrive at either \verb'eth0' or \verb'eth1', and this changes with time. Once again, this is rather contrived, but also means that, for example, a private VPN IP will be responded to from the LAN a computer is on. Although this is desirable in some cases, it continues to seem like surprising default behaviour. The solution to this is also simple, a pair of kernel parameters, set by the following, resolve the issue.
\begin{minted}{shell-session}
sysctl -w net.ipv4.conf.all.arp_announce=1
sysctl -w net.ipv4.conf.all.arp_ignore=1
\end{minted}
The final discovery I made is that many of these problems can be solved by changing the question. In my real world testing, explained in section \ref{section:real-world-testing}, the local portal lies behind a dual WAN router. This router allows the same port to be accessible via two WAN IPs, and avoids any routing complication as the router itself handles the NAT perfectly. Prior to this I was attempting to route outbound, similar to the situation described above, with some difficulty. Hence it is worth considering whether an architecture modification can make the routing simpler for the task you are trying to achieve.
The program overall is structured for future growth, and to provide flexibility for network architects to implement it as they see fit. This chapter makes clear how this is achieved using interfaces that are flexible, before providing details on the concrete implementations. TCP provides a proof of concept with less implementation effort, but with varying performance outside of ideal environments, allowing the structure to be thoroughly tested before a more complex UDP implementation was developed. Security allows for either external or internal solutions to be used, with future support built for more complex internal initial exchanges, allowing for security measures such as digital signatures. Overall, this chapter shows a highly flexible solution for a multi-path proxy. In the next chapter, it will be shown to be highly performant.