From 2bde53b3da6586ac25ba937b8f52a081377ef11b Mon Sep 17 00:00:00 2001 From: md403 Date: Fri, 14 May 2021 07:35:03 +0000 Subject: [PATCH] Update on Overleaf. --- 0_Proforma/proforma.tex | 6 ++++-- 1_Introduction/introduction.tex | 8 ++++---- 2_Preparation/preparation.tex | 24 ++++++++++++------------ 4_Evaluation/evaluation.tex | 4 ++-- 5_Conclusions/conclusions.tex | 1 + Preamble/preamble.tex | 1 - 6 files changed, 23 insertions(+), 21 deletions(-) diff --git a/0_Proforma/proforma.tex b/0_Proforma/proforma.tex index 27e43ef..511f8c4 100644 --- a/0_Proforma/proforma.tex +++ b/0_Proforma/proforma.tex @@ -6,12 +6,14 @@ Candidate Number: & 2373A \\ Project Title: & A Multi-Path Bidirectional Layer 3 Proxy \\ Examination: & Computer Science Tripos - Part II, 2021 \\ - Word Count: & 13716 \\ - Line Count: & 2705 \\ + Word Count: & 12057 \\ + Line Count: & 3564\footnotemark \\ Project Originator: & The dissertation author \\ Supervisor: & Michael Dodson \end{tabular} +\footnotetext[1]{Gathered using \texttt{cat **/*.go | wc -l}} + \vspace{6mm} \section*{Original Aims of the Project} diff --git a/1_Introduction/introduction.tex b/1_Introduction/introduction.tex index e40d647..70d2dbb 100644 --- a/1_Introduction/introduction.tex +++ b/1_Introduction/introduction.tex @@ -19,15 +19,15 @@ Using a proxy to combine connections provides three significant benefits: immedi \section{Existing Work} -Three pieces of existing work that will be examined for their usefulness are MultiPath TCP (MPTCP), Wireguard, and Cloudflare. Multipath TCP is an effort to expand TCP (Transmission Control Protocol) connections to multiple paths, and is implemented at the kernel layer such that applications which already use TCP can immediately take advantange of the multipath benefits. Wireguard is a state of the art Virtual Private Network (VPN), providing an excellent example for transmitting packets securely over the Internet. Finally, Cloudflare shows examples of how a high bandwidth network can be used to supplement multiple smaller networks, but in a different context to this project. This section focuses on how these examples do not satisfy the aims of this project, and how they provide useful initial steps and considerations for this project. +Three pieces of existing work that will be examined for their usefulness are MultiPath TCP (MPTCP), Wireguard, and Cloudflare. Multipath TCP is an effort to expand TCP (Transmission Control Protocol) connections to multiple paths, and is implemented at the kernel layer such that applications which already use TCP can immediately take advantage of the multipath benefits. Wireguard is a state of the art Virtual Private Network (VPN), providing an excellent example for transmitting packets securely over the Internet. Finally, Cloudflare shows examples of how a high bandwidth network can be used to supplement multiple smaller networks, but in a different context to this project. This section focuses on how these examples do not satisfy the aims of this project, and how they provide useful initial steps and considerations for this project. \subsection{MultiPath TCP (MPTCP)} -MultiPath TCP \citep{handley_tcp_2020} is an extension to the regular Transmission Control Protocol, allowing for the creation of subflows. MultiPath TCP was designed with two purposes: increasing resiliency and throughput for multi-homed mobile devices, and providing multi-homed servers with better control over balancing flows between their interfaces. Initially, MultiPath TCP seems like a solution to the aims of this project. However, it suffers for three reasons: the rise of User Datagram Protocol (UDP) -based protocols, device knowledge of interfaces, and legacy devices. +MPTCP \citep{handley_tcp_2020} is an extension to the regular Transmission Control Protocol, allowing for the creation of subflows. MPTCP was designed with two purposes: increasing resiliency and throughput for multi-homed mobile devices, and providing multi-homed servers with better control over balancing flows between their interfaces. Initially, MPTCP seems like a solution to the aims of this project. However, it suffers for three reasons: the rise of User Datagram Protocol (UDP) -based protocols, device knowledge of interfaces, and legacy devices. -Although many UDP-based protocols have been around for a long time, using UDP-based protocols in applications to replace TCP-based protocols is a newer effort. An example of an older UDP-based protocol is SIP \citep{schooler_sip_2002}, still widely used for VoIP calls, which would benefit particularly from increased resilience to single Internet connection outages. For a more recent UDP-based protocol intended to replace a TCP-based protocol, HTTP/3 \citep{bishop_hypertext_2021}, also known as HTTP-over-QUIC, is one of the largest. HTTP/3 is enabled by default in Google Chrome \citep{govindan_enabling_2020} and its derivatives, soon to be enabled by default in Mozilla Firefox \citep{damjanovic_quic_2021}, and available behind an experimental flag in Apple's Safari \citep{kinnear_boost_2020}. Previously, HTTP requests have been sent over TCP connections, but HTTP/3 switches this to a UDP-based protocol. As such, HTTP requests are moving away from benefiting from MPTCP. +Although many UDP-based protocols have been around for a long time, using UDP-based protocols in applications to replace TCP-based protocols is a newer effort. An example of an older UDP-based protocol is SIP \citep{schooler_sip_2002}, still widely used for VoIP, which would benefit particularly from increased resilience to single Internet connection outages. For a more recent UDP-based protocol intended to replace a TCP-based protocol, HTTP/3 \citep{bishop_hypertext_2021}, also known as HTTP-over-QUIC, is one of the largest. HTTP/3 is enabled by default in Google Chrome \citep{govindan_enabling_2020} and its derivatives, soon to be enabled by default in Mozilla Firefox \citep{damjanovic_quic_2021}, and available behind an experimental flag in Apple's Safari \citep{kinnear_boost_2020}. Previously, HTTP requests have been sent over TCP connections, but HTTP/3 switches this to a UDP-based protocol, reducing the benefit of MPTCP. -Secondly, devices using MPTCP must have knowledge of their network infrastructure. Consider the example of a phone with a WiFi and 4G interface reaching out to a voice assistant. The phone in this case can utilise MPTCP effectively, as it has knowledge of both Internet connections, and it can create subflows appropriately. However, consider instead a tablet with only a WiFi interface, but behind a router with two Wide Area Network (WAN) interfaces that is using Network Address Translation (NAT). In this case, the tablet will believe that it only has one connection to the Internet, while actually being able to take advantage of two. This is a problem that is difficult to solve at the client level, suggesting that solving the problem of combining multiple Internet connections is better suited to network infrastructure. +Secondly, devices using MPTCP must have knowledge of their network infrastructure. Consider the example of a phone with a WiFi and 4G interface reaching out to a voice assistant. The phone in this case can utilise MPTCP, as it has knowledge of both Internet connections. However, consider instead a tablet with only a WiFi interface, but behind a router with two Wide Area Network (WAN) interfaces using Network Address Translation (NAT). In this case, the tablet only sees one connection to the Internet, but could take advantage of two. This problem is difficult to solve at the client level, suggesting that solving the problem of combining multiple Internet connections is better suited to network infrastructure. Finally, it is important to remember legacy devices. Often, these legacy devices will benefit the most from resilience improvements, and they are the least likely to receive updates to new networking technologies such as MPTCP. Although MPTCP can still provide a significant balancing benefit to the servers to which legacy devices connect, the legacy devices see little benefit from the availability of multiple connections. In contrast, providing an infrastructure-level solution, such as the proxy presented here, benefits all devices behind it equally, regardless of their legacy status. diff --git a/2_Preparation/preparation.tex b/2_Preparation/preparation.tex index 8587f1c..d606f18 100644 --- a/2_Preparation/preparation.tex +++ b/2_Preparation/preparation.tex @@ -103,16 +103,16 @@ The benefits of using a VPN tunnel between the two proxies are shown in Figure \ \section{Language Selection} \label{section:language-selection} -In this section, I evaluate three potential languages (C++, Rust and Go) for the implementation of this software. To support this evaluation, I have provided a sample program in each language. The sample program is intended to be a minimal example of reading packets from a TUN interface, placing them in a queue from a single thread, and consuming the packets from the queue with multiple threads. These examples are given in figures \ref{fig:cpp-tun-sample} through \ref{fig:go-tun-sample}, in Appendix \ref{appendix:language-samples}. The first test was whether the small example was possible, which passed for all three languages. I then considered the performance of the language, clarity of code of the style needed to complete this software, and the ecosystem of the language. This culminated in choosing Go for the implementation language. +In this section, I evaluate three potential languages (C++, Rust and Go) for the implementation of this software. To support this evaluation, I have provided a sample program in each language. The sample program is a minimal example of reading packets from a TUN interface, placing them in a queue from a single thread, and consuming the packets from the queue with multiple threads. These examples are given in Figures \ref{fig:cpp-tun-sample} through \ref{fig:go-tun-sample}, in Appendix \ref{appendix:language-samples}. For each language, I considered the performance, code clarity, and the language ecosystem. This culminated in choosing Go for the implementation language. -Alongside the implementation language, a language is chosen to evaluate the implementation. Two potential languages are considered here, Python and Java. Though Python was initially chosen for rapid development and better ecosystem support, the final test suite is a combination of both Python and Java - Python for data processing, and Java for systems interaction. +I similarly evaluated two languages for the test suite: Python and Java. Though Python was initially chosen for rapid development and better ecosystem support, the final test suite is a combination of both Python and Java - Python for data processing, and Java for systems interaction. \subsection{Implementation Languages} \subsubsection{C++} -There are two primary advantages to completing this project in C++: speed of execution, and C++ being low level enough to achieve this project's goals (which turned out to be true for all considered languages). The negatives of using C++ are demonstrated in the sample script, given in Figure \ref{fig:cpp-tun-sample}, where it is immediately obvious that to achieve even the base functionality of this project, the code in C++ is multiple times the length of equivalent code in either Rust or Go, at 93 lines compared to 34 for Rust or 48 for Go. This difference arises from the need to manually implement the required thread safe queue, while it is available as a library for Rust, and included in the Go runtime. This manual implementation gives rise to additional risk of incorrect implementation, specifically with regards to thread safety, that could cause undefined behaviour, security vulnerabilities, and great difficulty debugging. Further, although open source queues are available, they are not handled by a package manager, and thus security updates would have to be manual, leaving opportunity for unfound bugs. +There are two primary advantages to completing this project in C++: speed of execution, and C++ being low level enough to achieve this project's goals (which turned out to be true for all considered languages). -The lack of memory safety in C++ is a significant negative of the language. Although C++ would provide increased performance over a language such as Go with a more feature-rich runtime, it is avoided due to the incidental complexity of manual memory management and the difficulty of manual thread safety. +The negatives of using C++ are demonstrated in the sample script, given in Figure \ref{fig:cpp-tun-sample}: achieving even the base functionality of this project requires multiple times more code than Rust or Go (93 lines compared to 34 for Rust or 48 for Go). This arises from the need to manually implement the required thread safe queue, which is available as a library for Rust, and included in the Go runtime. This manual implementation gives rise to additional risk of incorrect implementation, specifically with regards to thread safety, that could cause undefined behaviour, security vulnerabilities, and great difficulty debugging. Further, although open source queues are available, they are not handled by a package manager, and thus security updates would have to be manual, risking the introduction of bugs. Finally, C++ does not provide any memory safety guarantees. \subsubsection{Rust} @@ -151,7 +151,7 @@ The requirements of the project are detailed in the Success Criteria of the Proj The three categories of success criteria can be summarised as follows. The success criteria, or must have elements, are to provide a multi-path proxy that is functional, secure and improves speed and resilience in specific cases. The extended goals, or should have elements, are focused on increasing the performance and flexibility of the solution. The stretch goals, or could have elements, are aimed at increasing performance by reducing overheads, and supporting IPv6 alongside IPv4. -Beyond the success criteria, a requirement of the software produced is platform compatibility. As the proxy is expected to run on networking hardware, platforms such as Windows and MacOS will not be supported. However, networking hardware runs a wide variety of operating systems. The testing process will run on Linux and FreeBSD, but the software should be designed in such a way that more operating systems could be supported with minimal difficulty. +Beyond the success criteria, I wanted to demonstrate the practicality of my software on prototypic networking equipment; therefore, continuous integration testing and evaluation will run on Linux and FreeBSD. % ------------------------- Engineering Approach --------------------------- % \section{Engineering Approach} @@ -159,9 +159,9 @@ Beyond the success criteria, a requirement of the software produced is platform \subsubsection{Software Development Model} -The development of this software followed the agile methodology. Work was organised into weekly sprints, aiming for increased functionality in the software each time. By focusing on sufficient but not excessive planning, a minimum viable product was quickly established. From there, the remaining features could be implemented in the correct sized segments. Examples of these sprints are: initial build including configuration, TUN adapters and main program; TCP transport, enabling an end-to-end connection between the two parts; repeatable testing, providing the data to evaluate each iteration of the project against its success criteria; UDP transport for performance and control. +The development of this software followed the agile methodology. Work was organised into weekly sprints, aiming for increased functionality in the software each time. By focusing on sufficient but not excessive planning, a minimum viable product was quickly established. From there, the remaining features could be implemented in the correct sized segments. Examples of these sprints are: initial build including configuration, TUN adaptors and main program; TCP transport, enabling an end-to-end connection between the two parts; repeatable testing, providing the data to evaluate each iteration of the project against its success criteria; UDP transport for performance and control. -One of the most important features of any agile methodology is welcoming changing requirements \citep{beck_manifesto_2001}. As the project grew, it became clear where shortcomings existed, and these could be fixed in very quick pull requests. An example is given in Figure \ref{fig:changing-requirements}, in which the type of a variable was changed from \mintinline{go}{string} to \mintinline{go}{func() string}. This allowed for lazy evaluation, when it became clear that configuring fixed IP addresses or DNS names could be impractical with certain setups. The static typing in the chosen language enables refactors like this to be completed with ease, particularly with the development tools mentioned in the next section, reducing the incidental complexity of the agile methodology. +The agile methodology welcomse changing requirements \citep{beck_manifesto_2001}, and as the project grew, it became clear where shortcomings existed, and these could be fixed in very quick pull requests. An example is given in Figure \ref{fig:changing-requirements}, in which the type of a variable was changed from \mintinline{go}{string} to \mintinline{go}{func() string}. This allowed for lazy evaluation, when it became clear that configuring fixed IP addresses or DNS names could be impractical. Static typing enables refactors like this to be completed with ease, particularly with the development tools mentioned in the next section, reducing the incidental complexity of the agile methodology. \begin{figure} \centering @@ -181,17 +181,17 @@ One of the most important features of any agile methodology is welcoming changin \subsubsection{Development Tools} -A large part of the language choice focused on development tools. As discussed in Section \ref{section:language-selection}, IDE support is important for programming productivity. My preferred IDEs are those supplied by JetBrains,\footnote{\url{https://jetbrains.com/}} generously provided for education and academic research free of charge. As such, I used GoLand for the Go development of this project, IntelliJ for the Java evaluation development, and PyCharm for the Python evaluation program. Using an intelligent IDE, particularly with the statically typed Go and Java, can significantly increases programming productivity. They provide intelligent code suggestions and automated code generation for repetitive sections to reduce keystrokes, syntax highlighting for ease of reading, near-instant type checking without interaction, and many other features. Each reduce incidental complexity. +A large part of the language choice focused on development tools, particularly IDE support. I used GoLand (Go), IntelliJ (Java), and PyCharm (Python). Using intelligent IDEs, particularly with the statically-typed Go and Java, significantly increases programming productivity. They provide code suggestions and automated code generation for repetitive sections to reduce keystrokes, syntax highlighting for ease of reading, near-instant type checking without interaction, and many other features. Each reduce incidental complexity. -I used Git version control, with a self-hosted Gitea\footnote{\url{https://gitea.com/}} server as the remote. The repository contains over 180 commits, committed at regular intervals while programming. My repositories have a multitude of on- and off-site backups at varying frequencies (Multiple Computers + Git Remote + NAS + 2xCloud + 2xUSB). The Git remote was updated with every commit, the NAS and Cloud providers daily, with one USB updated every time significant work was added and the other a few days after. Having some automated and some manual backups, along with a wide variety of backup locations, ensures that the potential data loss in the event of any failure is minimal. The backups are regularly checked for consistency, to ensure no data loss goes unnoticed. +I used Git version control, with a self-hosted Gitea\footnote{\url{https://gitea.com/}} server as the remote. The repository contains over 180 commits, committed at regular intervals while programming. I maintained several on- and off-site backups (Multiple Computers + Git Remote + NAS + 2xCloud + 2xUSB). The Git remote was updated with every commit, the NAS and Cloud providers daily, with one USB updated every time significant work was added and the other a few days after. Having some automated and some manual backups, along with a variety of backup locations, minimises any potential data loss in the event of any failure. The backups are regularly checked for consistency, to ensure no data loss goes unnoticed. -Alongside my self-hosted Gitea server, I have a self-hosted Drone\footnote{\url{http://drone.io/}} server for continuous integration. This made it simple to add a Drone file to the repository, allowing for the Go tests to be run, formatting verified, and artefacts built. On a push, after the verification, each artefact is built and uploaded to a central repository, where it is saved under the branch name. This is particularly useful for automated testing, as the relevant artefact can be downloaded automatically from a known location for the branch under test. Further, artefacts are built for multiple architectures, particularly useful when performing real world testing spread between \texttt{AMD64} and \texttt{ARM64} architectures. +Alongside my Gitea server, I have a self-hosted Drone\footnote{\url{http://drone.io/}} server for continuous integration: running Go tests, verifying formatting, and building artefacts. On a push, after verification, each artefact is built, uploaded to a central repository, and saved under the branch name. This dovetailed with my automated testing, which downloaded the relevant artefact automatically for the branch under test. I also built artefacts for multiple architectures to support real world testing on \texttt{AMD64} and \texttt{ARM64} architectures. -Continuous integration and Git are used in tandem to ensure that all code in a pull request meets certain standards. By ensuring that tests are automatically run before merging, all code that is merged must be formatted correctly and able to pass the tests. This removes the possibility of accidentally causing an already tested for regression to occur during a merge by forgetting to run the tests. Pull requests also provide an opportunity to review submitted code, even with the same set of eyes, in an attempt to detect any glaring errors. Twenty-four pull requests were submitted to the repository for this project. +Continuous integration and Git are used in tandem to ensure that each pull request meet certain standards before merging, reducing the possibility of accidentally causing performance regressions. Pull requests also provide an opportunity to review submitted code, even with the same set of eyes, in an attempt to detect any glaring errors. Twenty-four pull requests were submitted to the repository for this project. \subsubsection{Licensing} -I have chosen to license this software under the MIT license. The MIT license is simple and permissive, enabling reuse and modification of the code, subject to including the license. Alongside the hopes that the code will receive updated pull requests over time, a permissive license allows others to build upon the given solution. A potential example of a solution that could build from this is a company employing a Software as a Service (SaaS) model to configure a remote proxy on your behalf, perhaps including the hardware required to convert this fairly involved solution into a plug-and-play option. +I chose to license this software under the MIT license, which is simple and permissive. % ---------------------------- Starting Point ------------------------------ % \section{Starting Point} diff --git a/4_Evaluation/evaluation.tex b/4_Evaluation/evaluation.tex index 3eab500..50bf8b0 100644 --- a/4_Evaluation/evaluation.tex +++ b/4_Evaluation/evaluation.tex @@ -109,7 +109,7 @@ For showing improved throughput over connections which are not equal, three resu \centering \begin{subfigure}{.49\textwidth} \includegraphics[width=0.9\linewidth]{graphs/more-bandwidth-unequal-a-inbound} - \caption{Throughput of proxied connections inbound to the client.} + \caption{Bandwidth of unequal connections compared to the } \label{fig:more-bandwidth-unequal-lesser} \end{subfigure} \begin{subfigure}{.49\textwidth} @@ -173,7 +173,7 @@ The extended goal of connection metric values has not been implemented. Instead, \subsection{UDP Proxy Flows} -Although UDP proxy flows are implemented, they are unable to provide improved performance over a TCP connection. +UDP flows are implemented, and provide a solid base for UDP testing and development. The present implementation of a New Reno imitating congestion control mechanism still has some implementation flaws, meaning that UDP is not yet feasible for use. However, the API for writing congestion control mechanisms is strong, and some of the features suggested in Section \ref{section:future-work} could be developed on this base, so that much is a success. \section{Performance Evaluation} \label{section:performance-evaluation} diff --git a/5_Conclusions/conclusions.tex b/5_Conclusions/conclusions.tex index ab4e372..357fd0d 100644 --- a/5_Conclusions/conclusions.tex +++ b/5_Conclusions/conclusions.tex @@ -28,6 +28,7 @@ On re-implementation of this work, more considerations should be made for the in Many of the lessons learnt relating to IP routing are detailed in Section \ref{section:implementation-system-configuration}, which would aid future implementations significantly, allowing the developer to focus only on what needs to occur in the application itself. Similarly, Figure \ref{fig:dataflow-overview} provides a comprehensive overview of the particularly complex dataflow within this application. These tools provide an effective summary of the information needed to implement this software again, reducing the complexity of such a new implementation, and allowing the developer to focus on the important features. \section{Future Work} +\label{section:future-work} Alternative methods of load balancing could take multipath proxies further. Having control of both proxies allows for a variety of load balancing mechanisms, of which congestion control is only one. An alternative method is to monitor packet loss, and use this to infer the maximum capacity of each link. These capacities can then be used to load balance packets by proportion as opposed to greedily with congestion control. This could provide performance benefits over congestion control by allowing the congestion control mechanisms of underlying flows to be better employed, while also having trade-offs with slower reaction to connection changes. diff --git a/Preamble/preamble.tex b/Preamble/preamble.tex index 33323a3..bc33585 100644 --- a/Preamble/preamble.tex +++ b/Preamble/preamble.tex @@ -100,7 +100,6 @@ %\usepackage{longtable} \usepackage{tabularx} - % *********************************** SI Units ********************************* \usepackage{siunitx} % use this package module for SI units