mirror of
https://git.overleaf.com/6227c8e96fcdc06e56454f24
synced 2024-11-21 15:32:00 +00:00
Update on Overleaf.
This commit is contained in:
parent
f2c5f8b5ab
commit
f62a549029
144
report.tex
144
report.tex
@ -498,14 +498,6 @@ sys:x:3:3:sys:/dev:/usr/sbin/nologin
|
||||
\subsection{Shared subtrees}
|
||||
\label{sec:voiding-mount-shared-subtrees}
|
||||
|
||||
Shared subtrees \citep{pai_shared_2005} were introduced to provide a consistent view of the unified hierarchy between namespaces. Consider the example in Listing \ref{lst:shared-subtrees}. \texttt{unshare(1)} creates a non-shared tree, which presents the behaviour shown. Although \texttt{/mnt/cdrom} from the parent namespace has been bind mounted in the new namespace, the content of \texttt{/mnt/cdrom} is not the same. This is because the filesystem newly mounted on \texttt{/mnt/cdrom} is unavailable in the separate mount namespace. To combat this, shared subtrees were introduced. That is, as long as \texttt{/mnt/cdrom} resides on a shared subtree, the newly mounted filesystem will be available to a bind of \texttt{/mnt/cdrom} in another namespace. \texttt{systemd} made the choice to mount \texttt{/} as a shared subtree \citep{free_software_foundation_mount_namespaces7_2021}:
|
||||
|
||||
\say{Notwithstanding the fact that the default propagation type for new mount is in many cases \texttt{MS\_PRIVATE}, \texttt{MS\_SHARED} is typically more useful. For this reason, \texttt{systemd(1)} automatically remounts all mounts as \texttt{MS\_SHARED} on system startup. Thus, on most modern systems, the default propagation type is in practice \texttt{MS\_SHARED}.}
|
||||
|
||||
This means that when creating a new namespace, mounts and unmounts are propagated by default. More specifically, it means that mounts and unmounts are propagated both from the parent namespace to the child, and from the child namespace to the parent. That is, if a mount is unmounted in the new namespace, it is also unmounted in the parent. This can be highly confusing behaviour, as it provides minimal isolation by default. \texttt{unshare(1)} considers this behaviour inconsistent with the goals of unsharing - it immediately calls \texttt{mount("none", "/", NULL, MS\_REC|MS\_PRIVATE, NULL)} after \texttt{unshare(CLONE\_NEWNS)}, detaching the newly unshared tree. The reasoning for enabling \texttt{MS\_SHARED} by default is that containers created should not present the behaviour given in Listing \ref{lst:shared-subtrees}, and this behaviour is unavoidable unless the parent mounts are shared, while it is possible to disable the behaviour where necessary.
|
||||
|
||||
While some other namespaces are inherited, for example UTS namespaces, they do not present the same problem as mount namespaces. Although UTS namespaces are inherited, it is trivial to create the conditions for a void process by setting the hostname of the machine to a constant. This removes any relation to the parent namespace and to the outside machine. Mount namespaces instead maintain a shared pointer with most filesystems, more akin to not creating a new namespace than an inherited namespace.
|
||||
|
||||
\begin{listing}
|
||||
\begin{minipage}{.49\textwidth}
|
||||
|
||||
@ -542,14 +534,22 @@ file_1 file_2
|
||||
\label{lst:shared-subtrees}
|
||||
\end{listing}
|
||||
|
||||
Shared subtrees \citep{pai_shared_2005} were introduced to provide a consistent view of the unified hierarchy between namespaces. Consider the example in Listing \ref{lst:shared-subtrees}. \texttt{unshare(1)} creates a non-shared tree, which presents the behaviour shown. Although \texttt{/mnt/cdrom} from the parent namespace has been bind mounted in the new namespace, the content of \texttt{/mnt/cdrom} is not the same. This is because the filesystem newly mounted on \texttt{/mnt/cdrom} is unavailable in the separate mount namespace. To combat this, shared subtrees were introduced. That is, as long as \texttt{/mnt/cdrom} resides on a shared subtree, the newly mounted filesystem will be available to a bind of \texttt{/mnt/cdrom} in another namespace. \texttt{systemd} made the choice to mount \texttt{/} as a shared subtree \citep{free_software_foundation_mount_namespaces7_2021}:
|
||||
|
||||
\say{Notwithstanding the fact that the default propagation type for new mount is in many cases \texttt{MS\_PRIVATE}, \texttt{MS\_SHARED} is typically more useful. For this reason, \texttt{systemd(1)} automatically remounts all mounts as \texttt{MS\_SHARED} on system startup. Thus, on most modern systems, the default propagation type is in practice \texttt{MS\_SHARED}.}
|
||||
|
||||
This means that when creating a new namespace, mounts and unmounts are propagated by default. More specifically, it means that mounts and unmounts are propagated both from the parent namespace to the child, and from the child namespace to the parent. That is, if a mount is unmounted in the new namespace, it is also unmounted in the parent. This can be highly confusing behaviour, as it provides minimal isolation by default. \texttt{unshare(1)} considers this behaviour inconsistent with the goals of unsharing - it immediately calls \texttt{mount("none", "/", NULL, MS\_REC|MS\_PRIVATE, NULL)} after \texttt{unshare(CLONE\_NEWNS)}, detaching the newly unshared tree. The reasoning for enabling \texttt{MS\_SHARED} by default is that containers created should not present the behaviour given in Listing \ref{lst:shared-subtrees}, and this behaviour is unavoidable unless the parent mounts are shared, while it is possible to disable the behaviour where necessary.
|
||||
|
||||
While some other namespaces are inherited, for example UTS namespaces, they do not present the same challenge as mount namespaces, as it is trivial to create the conditions for a void process by setting the hostname of the machine to a constant. This removes any relation to the parent namespace and to the outside machine. Mount namespaces instead maintain a shared pointer with most filesystems, more akin to not creating a new namespace than an inherited namespace.
|
||||
|
||||
\subsection{Lazy unmounting}
|
||||
\label{sec:voiding-mount-lazy-unmount}
|
||||
|
||||
Mount namespaces present further interesting behaviour when unmounting the old root filesystem. Although this may initially seem isolated to void processes, it is also a problem in a container system. Consider again the container created in Listing \ref{lst:shared-subtrees}: the existing root must be unmounted after pivoting, else the container remains fully connected to the outside root.
|
||||
The final interesting behaviour comes with unmounting the old root filesystem. Although this may initially seem isolated to void processes, it is also a problem in a container system. Consider again the container created in Listing \ref{lst:shared-subtrees}: the existing root must be unmounted after pivoting, else the container remains fully connected to the outside root.
|
||||
|
||||
Referring again to network namespaces, sockets continue to exist in their initial namespace, allowing for regular file-descriptor passing semantics \citep{biederman_re_2007}. Extending upon this socket behaviour is Wireguard, which creates adapters that may be freely moved between namespaces while continuing to connect externally from their initial parent \citep[§7.3]{donenfeld_wireguard_2017}.
|
||||
|
||||
Something which behaves differently is the memory mapping of a currently running process's binary. Consider the example in Listing \ref{lst:unshare-umount}, which shows a short C program and the result of running it. It is seen that the \texttt{/} mount is busy when attempting the unmount. Given that the process was created in the parent namespace, the behaviour of file descriptors would suggest that the process would maintain a link to the parent namespace for its own memory mapped regions. However, the fact that the otherwise empty namespace has a busy mount shows that this is not the case.
|
||||
Although file descriptors work in this way with mount namespaces, the memory mapping of a currently running process's binary does not. Consider the example in Listing \ref{lst:unshare-umount}, which shows a short C program and the result of running it. It is seen that the \texttt{/} mount is busy when attempting the unmount. Given that the process was created in the parent namespace, the behaviour of file descriptors would suggest that the process would maintain a link to the parent namespace for its own memory mapped regions. However, the fact that the otherwise empty namespace has a busy mount demonstrates that this is not the case.
|
||||
|
||||
\begin{listing}
|
||||
\begin{minted}{c}
|
||||
@ -605,7 +605,7 @@ directory
|
||||
\label{lst:unshare-umount-lazy}
|
||||
\end{listing}
|
||||
|
||||
This behaviour raises questions about why a shared subtree, which exists as an object, would need to be detached recursively - decreasing the reference count to the shared subtree itself would seem sufficient. The inconsistency is best explained by looking at the development timeline for the three features here: mount namespaces, shared subtrees, and recursive lazy unmounts. When lazy unmounting was added, in September 2001, the author said the following \citep{viro_patch_2001}:
|
||||
This behaviour raises questions about why a shared subtree, which exists as a reference counted object, would need to be detached recursively - decreasing the reference count to the shared subtree itself would seem sufficient. The inconsistency is best explained by looking at the development timeline for the three features here: mount namespaces, shared subtrees, and recursive lazy unmounts. When lazy unmounting was added, in September 2001, the author said the following \citep{viro_patch_2001}:
|
||||
|
||||
\say{There are only two things to take care of -
|
||||
a) if we detach a parent we should do it for all children
|
||||
@ -614,24 +614,24 @@ Both are obviously staisfied (sic) for current code (presence of children
|
||||
means that vfsmount is busy and we can't mount on something that
|
||||
doesn't exist).}
|
||||
|
||||
This logic held even in the presence of namespaces, with the initial patchset in February 2001 \citep{viro_patch_2001}, as mounts were not initially shared but duplicated between namespaces. However, when shared subtrees were added in January 2005 \citep{viro_rfc_2005}, this logic stopped holding.
|
||||
This logic held even in the presence of namespaces, with the initial patchset in February 2001 \citep{viro_patchcft_2001}, as mounts were not initially shared but duplicated between namespaces. However, when shared subtrees were added in January 2005 \citep{viro_rfc_2005}, this logic stopped holding.
|
||||
|
||||
When setting up a container environment, one calls \texttt{pivot\_root(2)} to replace the old root with a new root for the container. Then, the old root may be unmounted. Oftentimes the solution is to exec a binary in the new root first, meaning that the old root is no longer in use and may be unmounted. This works, as old root is only a reference in this namespace, and hence may be unmounted with children - the \texttt{vfsmount} in this namespace is not busy, contradicting an assertion in the quotation.
|
||||
When setting up a container environment, one calls \texttt{pivot\_root(2)} to replace the old root with a new root for the container. Only then may the old root may be unmounted. Oftentimes the solution is to exec a binary in the new root first, meaning that the old root is no longer in use and may be unmounted. This works, as old root is only a reference in this namespace, and hence may be unmounted with children - the \texttt{vfsmount} in this namespace is not busy, contradicting an assertion in the quotation.
|
||||
|
||||
If, instead, one wishes to continue running the existing binary, this is possible with lazy unmounting. However, the kernel only exposes a recursive lazy unmount. With shared subtrees, this results in destroying the parent tree. While this is avoidable by removing the shared propagation from the subtree before unmounting, the choice to have \texttt{MNT\_DETACH} aggressively cross shared subtrees can be highly confusing, and perhaps undesired behaviour in a world with shared subtrees by default.
|
||||
If one wishes to continue running the existing binary, this is possible with lazy unmounting. However, the kernel only exposes a recursive lazy unmount to user-space. With shared subtrees, this results in destroying the parent tree. While this is avoidable by removing the shared propagation from the subtree before unmounting, the choice to have \texttt{MNT\_DETACH} aggressively cross shared subtrees can be highly confusing, and perhaps undesired behaviour in a world with shared subtrees by default.
|
||||
|
||||
The API is particularly unfriendly to creating a void process. The initial state of a mount namespace is inherited, and many filesystems are mounted shared. This means that they propagate changes back through namespace boundaries. As the mount namespace does not allow for creating an entirely empty root, extra care must be taken in separating processes. Void processes mount an empty \texttt{tmpfs} file system in a new namespace, which doesn't propagate to the parent, and use the \texttt{pivot\_root(8)} command to make this the new root. By pivoting to the \texttt{tmpfs}, the old root exists as the only reference in the otherwise empty \texttt{tmpfs}. Finally, after ensuring the old root is set to \texttt{MNT\_PRIVATE} to avoid propagation, the old root can be lazily detached. This allows the binary from the parent namespace to continue running correctly. Any new processes only have access to the materials in the empty \texttt{tmpfs}. This new \texttt{tmpfs} never appears in the parent namespace, separating the void process effectively from the parent namespace.
|
||||
Void processes mount an empty \texttt{tmpfs} file system in a new namespace, which doesn't propagate to the parent, and use the \texttt{pivot\_root(8)} command to make this the new root. By pivoting to the \texttt{tmpfs}, the old root exists as the only reference in the otherwise empty \texttt{tmpfs}. Finally, after ensuring the old root is set to \texttt{MNT\_PRIVATE} to avoid propagation, the old root can be lazily detached. This allows the binary from the parent namespace to continue running correctly. Any new processes only have access to the materials in the empty \texttt{tmpfs}. This new \texttt{tmpfs} never appears in the parent namespace, separating the void process effectively from the parent namespace.
|
||||
|
||||
\section{User namespaces}
|
||||
\label{sec:voiding-user}
|
||||
|
||||
User namespaces provide isolation of security between processes. They isolate uids, gids, the root directory, keys and capabilities. Rather than the shim being a \texttt{setuid} or \texttt{CAP\_SYS\_ADMIN} binary, it can instead operate with ambient authority. This vastly simplifies the logic for opening file descriptors to pass the child processes, as the shim itself is already operating with correctly limited authority.
|
||||
User namespaces provide isolation of security credentials between processes. They isolate uids, gids, the root directory, keys and capabilities. Rather than the shim being a \texttt{setuid} or \texttt{CAP\_SYS\_ADMIN} binary, it can instead operate with ambient authority. This vastly simplifies the logic for opening file descriptors to pass the child processes, as the shim itself is already operating with correctly limited authority.
|
||||
|
||||
Similarly to many other namespaces, user namespaces suffer from needing to limit their isolation. For a user namespace to be useful, some relation needs to exist between processes in the user namespace and objects outside. That is, if a process in a user namespace shares a filesystem with a process in the parent namespace, there should be a way to share credentials. To achieve this with user namespaces a mapping between users in the namespace and users outside exists. The most common use-case is to map root in the user namespace to the creating user outside, meaning that a process with full privileges in the namespace will be constrained to the creating user's ambient authority.
|
||||
|
||||
To create an effective void process content must be written to the files \texttt{/proc/[pid]/uid\_map} and \texttt{/proc/[pid]/gid\_map}. In the case of the shim uid 0 and gid 0 are mapped to the creating user. This is done first such that the remaining stages in creating a void process can have root capabilities within the user namespace - this is not possible prior to writing to these files. Otherwise, \texttt{CLONE\_NEWUSER} combines effectively with other namespace flags, ensuring that the user namespace is created first. This enables the other namespaces to be created without additional permissions.
|
||||
|
||||
User namespaces again interact with \texttt{procfs}, which brings up an interesting limitation to the capabilities available in user namespaces. On most systems, \texttt{procfs} has a variety of mounts over parts of it. This might be to interact with a hypervisor such as Xen, to support \texttt{binfmt\_misc} for running special applications, or Docker protecting the host from container mishaps. Most interestingly with Docker, these mounts are used to protect the host from the container accessing certain files. The series of mounts on one of my machines are shown in Listing \ref{lst:docker-procfs}. The objects mounted over include \texttt{/proc/kcore}, which presents direct access to all of the kernel's allocatable memory. Linux protects these mounts by enforcing that \texttt{procfs} with mounts below it can only be mounted in a new place if the user has root privilege in the init namespace. Fortunately, one can instead perform a small dance of first binding \texttt{/proc} to the parent namespace before remounting it, which is allowed with mounts below. Further, by running the void process with restricted authority (limited to that of the calling user even as root), the dangerous files in \texttt{/proc} are protected using discretionary access control. This avoids the requirement of adding extra mounts in the void orchestrator.
|
||||
User namespaces again interact with \texttt{procfs}, which brings up an interesting limitation to the capabilities available in user namespaces. On most systems, \texttt{procfs} has a variety of mounts over parts of it. This might be to interact with a hypervisor such as Xen, to support \texttt{binfmt\_misc} for running special applications, or Docker protecting the host from container mishaps. The series of mounts in a Docker container on one of my machines are shown in Listing \ref{lst:docker-procfs}. The objects mounted over include \texttt{/proc/kcore}, which presents direct access to all of the kernel's allocatable memory. Linux protects these mounts by enforcing that \texttt{procfs} with mounts below it can only be mounted in a new place if the user has root privilege in the init namespace. Fortunately, one can instead bind \texttt{/proc} to the parent namespace before remounting it, which is allowed with mounts below. Further, by running the void process with restricted authority (limited to that of the calling user even as root), the dangerous files in \texttt{/proc} are protected using discretionary access control. This avoids the requirement of adding extra mounts in the void orchestrator.
|
||||
|
||||
\begin{listing}
|
||||
\begin{minted}{shell-session}
|
||||
@ -654,7 +654,7 @@ tmpfs /proc/scsi tmpfs ro,relatime 0 0
|
||||
\label{lst:docker-procfs}
|
||||
\end{listing}
|
||||
|
||||
User namespaces act as both a blessing and a curse for security. In the case of Docker, with CVE-2021-21284, a remapped user may be able to alter the initial source of the mappings, causing them to be overridden and gaining root access. In contrast with containerd, with CVE-2021-23021, an always root containerd daemon mounts files that shouldn't be accessible with DAC due to a logic error. Mapped user namespaces preserve DAC, protecting against this sort of incorrect code compared to a root daemon.
|
||||
User namespaces act as both a blessing and a curse for security. In the case of Docker, with CVE-2021-21284, a remapped user may be able to alter the initial source of the mappings, causing them to be overridden and gaining root access. In contrast with containerd, with CVE-2021-23021, an always root containerd daemon mounts files that shouldn't be accessible with DAC due to a logic error - mapped user namespaces preserve DAC, preventing such errors.
|
||||
|
||||
\section{Control group namespaces}
|
||||
\label{sec:voiding-cgroup}
|
||||
@ -667,18 +667,18 @@ Control group (cgroup) namespaces provide limited isolation of the cgroup hierar
|
||||
\item Unshare the cgroup namespace.
|
||||
\end{enumerate}
|
||||
|
||||
By following this sequence of calls, the process in the void would only see the leaf which contains itself and nothing else, limiting access to the host system. Running the shim with ambient authority here presents an issue as the cgroup hierarchy relies on discretionary access control. In order to move the process into a leaf the shim must have sufficient authority to modify the cgroup hierarchy. Due to this behaviour, and hence the unreliability of correctly voiding cgroup processes, the void orchestrator settles with only the third step - voiding the cgroup namespace. This makes cgroups the only namespace which can't be voided with ambient authority, suggesting strong need for kernel changes.
|
||||
By following this sequence of calls, the process in the void would only see the leaf which contains itself and nothing else, limiting access to the host system. Running the shim with ambient authority here presents an issue as the cgroup hierarchy relies on discretionary access control. In order to move the process into a leaf the shim must have sufficient authority to modify the cgroup hierarchy. Due to this behaviour, and hence the unreliability of correctly voiding cgroup processes, the void orchestrator performs only the third step - unsharing the cgroup namespace. This makes cgroups the only namespace which can't be voided with ambient authority, suggesting a strong need for kernel changes.
|
||||
|
||||
There are two problems when working with cgroups namespaces in user-space: needing sufficient discretionary access control, and leaving the control of individual application processes in a global namespace. An alternative kernel design would increase the utility by solving both of these problems. A process in a new cgroups namespace could instead create a detached hierarchy with the process as a leaf of the root and full permissions in the user-namespace that created it. The main cgroups hierarchy could then still see a single application to control, while the application itself would have full access over sharing its resources. This presents the ability for mechanisms of managing cgroups to clash between the namespaces, as the outer namespace would now have control over what resources are delegated to the application rather than each process in the application. Such a system would also provide improved behaviour over the current, which requires a delegation flag to be handed to the manager informing it to go no further down the tree. This would be significantly better enforced with namespaces. That is, the main namespace could be handled by \texttt{systemd}, while the \texttt{/docker} namespace could be internally managed by docker. This would allow \texttt{systemd} to move the \texttt{/docker} namespace around as required, with no awareness of the choices made internally.
|
||||
There are two problems when working with cgroups namespaces in user-space: needing sufficient discretionary access control, and leaving the control of individual application processes in a global namespace. I posture an alternative cgroup namespace: a process in a new cgroups namespace creates a detached hierarchy with the process as a leaf of the root and full permissions in the user namespace that created it. The main cgroups hierarchy sees a single application to control, while the application itself would have full access over sharing its externally limited resources. This would make the delegation of resources explicit in the design. That is, the main namespace could be handled by \texttt{systemd}, while the \texttt{/docker} namespace could be internally managed by docker. This would allow \texttt{systemd} to move the \texttt{/docker} namespace around as required, with isolation of the choices made internally.
|
||||
|
||||
\section{Creation cost}
|
||||
\label{sec:void-creation-costs}
|
||||
|
||||
As shown in this chapter, creating a void requires creating 7 distinct namespaces to hide access to everything that is possible. There are two options to create these namespaces: \texttt{clone(2)} or \texttt{unshare(2)}. As the void orchestrator uses \texttt{clone(2)} we evaluate the performance of this tool.
|
||||
As shown in this chapter, creating a void requires creating 7 distinct namespaces for isolation. There are two options to create these namespaces: \texttt{clone(2)} or \texttt{unshare(2)}. As the void orchestrator uses \texttt{clone(2)} we evaluate the performance of this tool.
|
||||
|
||||
These tests were run on my development machine, using Linux 5.15.0-33-generic on Ubuntu 22.04 LTS. It is a Xen based virtual machine, hence absolute results are less important than trends. The test process calls \texttt{clone(2)} with the requisite flags, then waits for the child process to exit. The child process exits immediately after returning from clone. The time is taken from before the \texttt{clone(2)} call and after the \texttt{wait} call returns using the high precision \texttt{CLOCK\_MONOTONIC}. This code is compiled into a tight C for loop, which executes 1250 times. The first 250 entries of each run are discarded. Prior to running the variety of clone tests, 12500 clone calls are made in an attempt to warm up the system.
|
||||
These tests were run on my development machine, using Linux 5.15.0-33-generic on Ubuntu 22.04 LTS. It is a Xen based virtual machine, hence absolute results are less important than trends. The test process calls \texttt{clone(2)} with the requisite flags, then waits for the child process to exit. The child process exits immediately after returning from clone. The time is taken from before the \texttt{clone(2)} call and after the \texttt{wait(2)} call returns using the high precision \texttt{CLOCK\_MONOTONIC}. This code is compiled into a tight C for loop, which executes 1250 times. The first 250 entries of each run are discarded. Prior to running the variety of clone tests, 12500 clone calls are made in an attempt to warm up the system.
|
||||
|
||||
Figure \ref{fig:namespace-times} compares the time of \texttt{clone(2)} calls with a single namespace creation flag, and a \texttt{clone(2)} call that creates no namespaces. Ignoring the anomaly that a clone call which creates a namespace is cheaper than one which doesn't, there is a clear difference shown in the creation time of network namespaces compared to user. This aligns with different namespaces having to protect different areas of the system. Further, we see that creating a network namespace is approximately four times slower than not creating any.
|
||||
Figure \ref{fig:namespace-times} compares the time of \texttt{clone(2)} calls with a single namespace creation flag, and a \texttt{clone(2)} call that creates no namespaces. Ignoring the anomaly that a clone call which creates a namespace is cheaper than one which doesn't, there is a clear difference shown in the creation time of network namespaces compared to user. Further, we see that creating a network namespace is approximately four times slower than not creating any.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
@ -688,7 +688,7 @@ Figure \ref{fig:namespace-times} compares the time of \texttt{clone(2)} calls wi
|
||||
\label{fig:namespace-times}
|
||||
\end{figure}
|
||||
|
||||
As void processes must create multiple namespaces to effectively isolate processes the creating of multiple namespaces is of more interest than a single one at a time. The creation of multiple namespaces is shown in Figure \ref{fig:namespace-stacked-times}. Here the divide between the three slowest namespaces in Figure \ref{fig:namespace-times} is exaggerated massively, showing a significant divide between the quick four namespaces and the slow final three.
|
||||
As void processes must create multiple namespaces to effectively isolate processes, creating multiple namespaces is shown in Figure \ref{fig:namespace-stacked-times}. Here the divide between the three slowest namespaces in Figure \ref{fig:namespace-times} is exaggerated, showing a significant divide between the quick four namespaces and the slow final three.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
@ -700,35 +700,35 @@ As void processes must create multiple namespaces to effectively isolate process
|
||||
|
||||
\section{Summary}
|
||||
|
||||
In this chapter I presented the 8 namespaces available in Linux 5.15. What each namespace protects against, how to completely empty each created namespace, and the constraints in doing so were presented. For cgroup and mount namespaces, alternative designs that increase the usability of the namespaces were discussed.
|
||||
In this chapter I discussed the 8 namespaces available in Linux 5.15: what each namespace protects against, how to completely empty each created namespace, and the constraints in doing so. For cgroup and mount namespaces, alternative designs that increase the usability of the namespaces were discussed.
|
||||
|
||||
Now that the motivation for emptying namespaces has been shown with the avoidance of vulernabilities, facilities to re-expose some of the system must be introduced in order to make useful applications. The methods for reintroducing parts of the system are given in Chapter \ref{chap:filling-the-void}, before demonstrating how to build useful applications in Chapter \ref{chap:building-apps}.
|
||||
Emptying namespaces has been motivated by protection from vulnerabilities. Facilities to re-expose some of the system must now be introduced to build useful applications. The methods for reintroducing parts of the system are given in Chapter \ref{chap:filling-the-void}, before demonstrating how to build applications in Chapter \ref{chap:building-apps}.
|
||||
|
||||
|
||||
\chapter{Filling the Void}
|
||||
\label{chap:filling-the-void}
|
||||
|
||||
Now that a completely empty set of namespaces are available for a void process, the ability to reinsert specific privileges must be added to support non-trivial applications. To allow for running applications as void processes with minimal kernel changes, this is achieved using a mixture of file-descriptor capabilities and adding elements to the empty namespaces. Capabilities allow for very explicit privilege passing where suitable, while adding elements to namespaces supports more of Linux's modern features.
|
||||
Now that a completely empty set of namespaces are available for a void process, the ability to reinsert specific privileges must be added to support non-trivial applications. This is achieved using a mixture of file-descriptor capabilities (§\ref{sec:priv-sep-ownership}) and adding elements to the empty namespaces (§\ref{sec:priv-sep-perspective}). Capabilities allow for very explicit privilege passing where suitable, while adding elements to namespaces supports more of Linux's modern features.
|
||||
|
||||
\section{mount namespace}
|
||||
\section{Mount namespace}
|
||||
\label{sec:filling-mount}
|
||||
|
||||
There are two options to provide access to files and directories in the void. Firstly, for a single file, an opened file descriptor can be offered. Consider the TLS broker of a TLS server with a persistent certificate and keyfile. Only these files are required to correctly run the application - no view of a filesystem is necessary. Providing an already opened file descriptor gives the process a capability to those files while requiring no concept of a filesystem, allowing that to remain a complete void. This is possible because of the semantics of file descriptor passing across namespaces - the file descriptor remains a capability, regardless of moving into a namespace without access to the file in question.
|
||||
There are two options to provide access to files and directories in the void. For a single file, an opened file descriptor can be offered. Consider the TLS broker of a TLS server with a persistent certificate and keyfile. Only these files are required to correctly run the application - no view of a filesystem is necessary. Providing an already opened file descriptor gives the process a capability to those files while requiring no concept of a filesystem, allowing that to remain a complete void. This is possible because of the semantics of file descriptor passing across namespaces - the file descriptor remains a capability, regardless of moving into a namespace without the file.
|
||||
|
||||
Alternatively, files and directories can be mounted in the void process's namespace. This supports three things which the capabilities do not: directories, dynamic linking, and applications which have not been adapted to use file descriptors. Firstly, the existing \texttt{openat(2)} calls are not suitable by default to treat directory file descriptors as capabilities, as they only retain the path of the directory when in a different root or namespace. This means that a process with a directory file descriptor in another namespace cannot use it to access files below the namespace, removing all utility as capabilities in the void. Secondly, dynamic linking is best served by binding files, as these read only copies and the trusted binaries ensure that only the required libraries can be linked against. Finally, support for individual required files can be added by using file descriptors, but many applications will not trivially support it. Binding files allows for some backwards compatibility with applications that are more difficult to adapt.
|
||||
Alternatively, files and directories can be mounted in the void process's namespace. This supports three things which the capabilities do not: directories, dynamic linking, and existing applications. Firstly, the existing \texttt{openat(2)} and resultant directory file descriptors are not suitable as capabilities - they only retain the path of their directory internally. This means that a process with a directory file descriptor in another namespace cannot use it to access files outside of the namespace, removing all utility as capabilities in the void. Secondly, dynamic linking is best served by binding files, as these read only copies and the trusted binaries ensure that only the required libraries can be linked against. Finally, support for individual required files can be added by using file descriptors, but many applications will not trivially support it. Binding files allows for some backwards compatibility with applications that are more difficult to adapt.
|
||||
|
||||
\section{network namespace}
|
||||
\section{Network namespace}
|
||||
\label{sec:filling-net}
|
||||
|
||||
Reintroducing networking to a void process follows a similar capability-based paradigm to reintroducing files. Rather than providing the full Linux networking subsystem to a void process, it is instead handed a file descriptor that already has the requisite networking permissions. A capability for an inbound networking socket can be requested statically in the application's specification, which fits well with the earlier specified threat model. This socket remains open and allows the application to continuously accept requests, generating the appropriate socket for each request within the application itself. These request capabilities can be dealt with in the same process or handed back to the shim to be distributed to another void process.
|
||||
Reintroducing networking to a void process follows a similar capability-based paradigm to reintroducing files. Rather than providing the full Linux networking subsystem to a void process, it is instead handed a file descriptor that already has the requisite networking permissions. A capability for an listening networking socket can be requested statically in the application's specification. This socket remains open and allows the application to continuously accept requests, generating the appropriate socket for each request within the application itself. These request capabilities can be dealt with in the same void process or handed back to the shim to be distributed to another void process.
|
||||
|
||||
Outbound networking is more difficult to re-add to a void process than inbound networking. The approach that containerisation solutions such as Docker take by default is using NAT with bridged adapters [RN]. That is, the container is provided an internal IP address that allows access to all networks via the host. Virtual machine solutions take a similar approach, creating bridged Ethernet adapters on the outside network or on a private NAT. Each of these approaches give the container/machine the appearance of unbounded outbound access, relying on firewalls to limit this afterwards. This does not fit well with the ethos of creating a void process - minimum privilege by default. An ideal solution would provide precise network access to the void, rather than adding all access and restricting it in post. This is achieved with inbound sockets by providing the precise and already connected socket to an otherwise empty network namespace, which does not support creating exposed inbound sockets of its own.
|
||||
Outbound networking is more difficult to re-add to a void process than inbound networking. Containerisation solutions such as Docker use NAT with bridged adapters by default. That is, the container is provided an internal IP address that allows access to all networks via the host. Virtual machine solutions take a similar approach, creating bridged Ethernet adapters on the outside network or on a private NAT. Each of these approaches give the container/machine the appearance of unbounded outbound access, relying on firewalls to limit this afterwards. This does not fit well with the ethos of creating a void process - minimum privilege by default. An ideal solution would provide precise network access to the void, rather than adding all access and restricting it in post. This is achieved with inbound sockets by providing the precise and already connected socket to an otherwise empty network namespace, which does not support creating exposed inbound sockets of its own.
|
||||
|
||||
Consideration is given to providing outbound access with statically created and passed sockets, the same as inbound access. For example, a socket to a database could be specified in the specification, or even one per worker process. The downside of this approach is that the socket lifecycle is still handled by the kernel. While this could work well with UDP sockets, TCP sockets can fail because the remote was closed or a break in the path caused a timeout to be hit.
|
||||
Consideration is given to providing outbound access with statically created and passed sockets, the same as inbound access - a socket to a database could be specified statically in the specification. The downside of this approach is that the socket lifecycle is still handled by the kernel. While this could work well with UDP sockets, TCP sockets can fail because the remote was closed or a break in the path caused a timeout to be hit.
|
||||
|
||||
Given that statically giving sockets is infeasible and adding a firewall does not fit well with creating a void, I sought an alternative API. \texttt{pledge(2)} is a system call from OpenBSD which restricts future system calls to an approved set \citep{the_openbsd_foundation_pledge2_2022}. This seems like a good fit, though operating outside of the operating system makes the implementation very different. Acceptable sockets are specified in the application specification, then an interaction socket is provided to request various pre-approved sockets from the shim layer. This allows limited access to the host network, approved or denied at request time instead of by a firewall. That is, access to a precisely configured socket can be injected to the void, with a capability to request such sockets and a capability given for each socket requested.
|
||||
Given that statically giving sockets is infeasible and adding a firewall does not fit well with creating a void, I sought an alternative API. \texttt{pledge(2)} is a system call from OpenBSD which restricts future system calls to an approved set \citep{the_openbsd_foundation_pledge2_2022}. This seems like a good fit, though operating outside of the operating system makes the implementation very different. Acceptable sockets are specified in the application specification, then an interaction socket is provided to request various pre-approved sockets from the shim. This allows limited access to the host network, approved or denied at request time instead of by a firewall. That is, access to a precisely configured socket can be injected to the void, with a capability to request such sockets and a capability given for each socket requested.
|
||||
|
||||
\section{user namespace}
|
||||
\section{User namespace}
|
||||
\label{sec:filling-user}
|
||||
|
||||
Filling a user namespace is a slightly odd concept compared to the namespaces already discussed in this section. A user namespace comes with no implicit mapping of users whatsoever (§\ref{sec:voiding-user}). To enable applications to be run with bounded authority, a single mapping is added by the Void Orchestrator of \texttt{root} in the child user namespace to the launching UID in the parent namespace. This means that the user with highest privilege in the container, \texttt{root}, will be limited to the access of the launching user. The behaviour of mapping \texttt{root} to the calling user is shown with the \texttt{unshare(1)} command in Listing \ref{lst:mapped-root-directory}, where a directory owned by the calling user, \texttt{alice}, appears to be owned by \texttt{root} in the new namespace. A file owned by \texttt{root} in the parent namespace appears to be owned by \texttt{nobody} in the child namespace, as no mapping exists for that file's user.
|
||||
@ -750,44 +750,44 @@ drwxrwxr-x 7 root root 4096 Feb 27 17:52 repos
|
||||
\label{lst:mapped-root-directory}
|
||||
\end{listing}
|
||||
|
||||
The way user namespaces are currently used creates a binary system: either a file appears as owned by \texttt{root} if owned by the calling user, or appears as owned by \texttt{nobody} if not (ignoring groups for clarity, though their behaviour is similar). One questions whether more users could be mapped in, but this presents additional difficulties. Firstly, \texttt{setgroups(2)} system call must be denied to achieve correct behaviour in the child namespace. This is because the \texttt{root} user in the child namespace has full capabilities, which include \texttt{CAP\_SETGID}. This means that the user in the namespace can drop their groups, potentially allowing access to materials which the creating user did not (consider a file with permissions \texttt{0707}). This limits the utility of switching user in the child namespace, as the groups must remain the same. Secondly, mapping to users and groups other than oneself requires \texttt{CAP\_SETUID} or \texttt{CAP\_SETGID} in the parent namespace. Avoiding this is well advised to reduce the ambient authority of the shim.
|
||||
The way user namespaces are currently used creates a binary system: either a file appears as owned by \texttt{root} if owned by the calling user, or appears as owned by \texttt{nobody} if not. One questions whether more users could be mapped in, but this presents additional difficulties. Firstly, \texttt{setgroups(2)} system call must be denied to achieve correct behaviour in the child namespace. This is because the \texttt{root} user in the child namespace has full capabilities, which include \texttt{CAP\_SETGID}. This means that the user in the namespace can drop their groups, potentially allowing access to materials which the creating user did not (consider a file with permissions \texttt{0707}). This limits the utility of switching user in the child namespace, as the groups must remain the same. Secondly, mapping to users and groups other than oneself requires \texttt{CAP\_SETUID} or \texttt{CAP\_SETGID} in the parent namespace. Avoiding this is well advised to reduce the ambient authority of the shim.
|
||||
|
||||
Voiding the user namespace initially provides the ability to create other namespaces with ambient authority, and hides the details of the void process's ambient permissions from inside. Although this creates a binary system of users which may at first seem limiting, applying the context of void processes demonstrates that it is not. Linux itself may utilise users, groups and capabilities for process limits, but void processes only provide what is absolutely necessary. That is, if a process should not have access to a file owned by the same user, it is simply not made available. Running only as \texttt{root} within the void process is therefore not a problem - multiple users is a feature of Linux which doesn't assist void processes in providing minimum privilege, so is absent.
|
||||
Voiding the user namespace initially provides the ability to create other namespaces with ambient authority, and hides the details of the void process's ambient permissions from inside. Although this creates a binary system of users which may at first seem limiting, applying the context of void processes demonstrates that it is not. Linux itself may utilise users, groups and capabilities for process limits, but void processes only provide minimum privilege by default. That is, if a process should not have access to a file owned by the same user, it is simply not made available. Running only as \texttt{root} within the void process is therefore not a problem - multiple users is a feature of Linux which doesn't assist void processes in providing minimum privilege, so is absent.
|
||||
|
||||
\section{Remaining namespaces}
|
||||
|
||||
\subsection{uts namespace}
|
||||
\subsection{UTS namespace}
|
||||
\label{sec:filling-uts}
|
||||
|
||||
uts namespaces are easily voided by setting the two controlled strings to a static string. However, if one wishes for them to hold specific values, they can be set in one of two ways: either calling \texttt{sethostname(2)} or \texttt{setdomainname(2)} from within the void process, or by providing static values within the void process's specification.
|
||||
UTS namespaces are easily voided by setting the two controlled strings to a static string. However, if one wishes for them to hold specific values, they can be set in one of two ways: either calling \texttt{sethostname(2)} or \texttt{setdomainname(2)} from within the void process, or by providing static values within the void process's specification.
|
||||
|
||||
\subsection{ipc namespace}
|
||||
\subsection{IPC namespace}
|
||||
\label{sec:filling-ipc}
|
||||
|
||||
Filling ipc namespaces is also not possible in this context, as ipc namespaces are created empty (§\ref{sec:voiding-ipc}). IPC objects exist in one and only one ipc namespace, due to sharing what they expect to be a global namespace of keys. This means that existing IPC objects cannot be mapped into the void process's namespace. However, the process within the ipc namespace can use IPC objects, for example between threads. This is potentially inadvisable, because different void processes would provide stronger isolation than IPC within a single void process. Alternative IPC methods are available which use the filesystem namespace and are better shared in a controlled fashion between void processes.
|
||||
Filling IPC namespaces is also not possible in this context, as IPC namespaces are created empty (§\ref{sec:voiding-ipc}). IPC objects exist in one and only one IPC namespace, due to sharing what they expect to be a global namespace of keys. This means that existing IPC objects cannot be mapped into the void process's namespace. However, the process within the IPC namespace can use IPC objects, for example between threads. This is potentially inadvisable, because different void processes would provide stronger isolation than IPC within a single void process. Alternative IPC methods are available which use the filesystem namespace and are better shared between void processes.
|
||||
|
||||
\subsection{pid namespace}
|
||||
\subsection{PID namespace}
|
||||
\label{sec:filling-pid}
|
||||
|
||||
A created pid namespace exists by itself, with no concept of mapping in PIDs from the parent namespace. The first process created in the namespace becomes PID 1, and after that other processes can be spawned from within. As such there is no need to fill pid namespaces, instead applications can be restructured to not expect seeing other process's IDs.
|
||||
A created PID namespace exists by itself, with no concept of mapping in PIDs from the parent namespace. The first process created in the namespace becomes PID 1, and after that other processes can be spawned from within. As such there is no need to fill PID namespaces, instead applications can be restructured to not expect seeing other process's IDs.
|
||||
|
||||
\subsection{cgroup namespace}
|
||||
\subsection{Control group namespace}
|
||||
\label{sec:filling-cgroup}
|
||||
|
||||
cgroup namespaces present some very interesting behaviour in this regard. What appears to be the root in the new cgroup namespace is in fact a subtree of the hierarchy in the parent. This again provides a quite strange concept of filling - elements of the tree cannot be cloned to appear in two places, by design. To provide fuller interaction with the cgroups system, one can instead bind whichever subtree they wish to act on from the parent mount namespace to the child mount namespace. This provides the control of any section of the cgroups subtree seen fit, and is unaffected by the cgroups namespace of the child. That is, the cgroups namespace is used only to provide a void, and the mount namespace can be used to operate on cgroups.
|
||||
|
||||
\section{System Design}
|
||||
\section{System design}
|
||||
\label{sec:system-design}
|
||||
|
||||
At this point in the thesis the theory of creating a void process (§\ref{chap:entering-the-void}) and the theory of filling a void with enough privilege to do useful work. Now I present some more detail on the system that combines these together in a useful aid to privilege separation.
|
||||
At this point in the thesis the theory of creating a void process (§\ref{chap:entering-the-void}) and the theory of filling a void with enough privilege to do useful work have been laid out. Now I present some more detail on the system that combines these together in a useful aid to privilege separation.
|
||||
|
||||
The central contribution of void processes is the void orchestrator, a shim that uses an application binary and a textual specification to set up the multiple processes required for privilege separation. The specification describes a series of entrypoints, each of which contain three things: a trigger to create the process, a list of arguments, and extra elements for the environment. Example specifications are listed in Chapter \ref{chap:building-apps}.
|
||||
The central contribution of void processes is the void orchestrator, a shim that uses an application binary and a textual specification to set up the multiple processes required for privilege separation. The specification describes a series of entrypoints, each of which contain three things: a trigger to create the process, a list of arguments, and extra elements for the environment. Example specifications are included in Chapter \ref{chap:building-apps}.
|
||||
|
||||
There are two types of entrypoints: those spawned statically at startup, and those spawned dynamically when triggered by an event. This event, as shown in the TLS server example (§\ref{sec:building-tls}), is most commonly sending one or more file descriptors from a different void process. File descriptors are the primary method of communication as they are not impacted by namespaces in any way.
|
||||
There are two types of entrypoints: those spawned statically at startup, and those spawned dynamically when triggered by an event. This event, as shown in the TLS server example (§\ref{sec:building-tls}), is most commonly sending one or more file descriptors from a different void process. File descriptors are the primary method of communication as they are not impacted by namespaces in any way - performance or isolation.
|
||||
|
||||
When a void process is spawned, the void orchestrator uses the specification to pass the pre-specified privilege into the void. A concept of spawners is also provided, which allows a process which has lost the privilege to create a void to spawn a void when required. Spawners are an additional process, existing in a void, which create further voids on demand.
|
||||
When a void process is spawned, the void orchestrator uses the specification to pass the pre-specified privilege into the void. A concept of spawners is also provided, which allows a void process to trigger the creation of another void process. Spawners are an additional process, existing in a void, which create further voids on demand. This allows for separation of privilege by handing work off to a differently privileged void when needed.
|
||||
|
||||
The void orchestrator serves to remove many of the complex syscalls from user space programming, leaving only a few which could eventually be handled by a library (§\ref{sec:future-work-macros}). A central repository of how for creating a void application serves as a single point to upgrade when new ambient authority is added and needs to be removed.
|
||||
The void orchestrator handles the management of the void processes which make up an application, reducing the burden on the application programmer. A central knowledge bank for preparing a void serves as a single point to upgrade when new ambient authority is added and needs to be removed.
|
||||
|
||||
\section{Summary}
|
||||
|
||||
@ -797,12 +797,12 @@ Included in the goal of minimising privilege is providing new APIs to support th
|
||||
\chapter{Building Applications}
|
||||
\label{chap:building-apps}
|
||||
|
||||
This section discusses the process of creating applications which utilise void processes. Then an application which requires no privilege is demonstrated (§\ref{sec:building-fib}), showing how to put together a simple application that takes advantage of void processes to start with no privilege. Finally, a basic HTTP file server with TLS support is designed and built from the ground up for void processes (§\ref{sec:building-tls}).
|
||||
This section discusses the process of creating applications which utilise void processes. An application which requires only basic privilege is demonstrated (§\ref{sec:building-fib}), showing how to put together a simple application that takes advantage of void processes to start with minimum privilege. We conclude with a basic HTTP file server with TLS support, designed and built from the ground up for void processes (§\ref{sec:building-tls}).
|
||||
|
||||
\section{Fibonacci}
|
||||
\label{sec:building-fib}
|
||||
|
||||
To begin displaying the power of the void orchestrator system we will develop an application that requires completely minimal privilege. The application and its fixed output are shown, unmodified, in Listing \ref{lst:fibonacci-application}. The application is written in Rust, my language of choice, but there is no such requirement - an equivalent program would look very similar in C. The limited code of this example makes the privilege requirements quite clear. Computing \texttt{fib} requires no privilege at all, operating purely on numbers on the stack. Once the values are computed they are printed using the \texttt{println!} macro, which prints to stdout. Therefore the only privilege this application requires to correctly run is access to stdout.
|
||||
To begin displaying the power of the void orchestrator system we will develop an application that requires very modest privilege. The application and its fixed output are shown, unmodified, in Listing \ref{lst:fibonacci-application}. It is written in Rust, my language of choice, but there is no such requirement - an equivalent program would look very similar in C. The limited code of this example makes the privilege requirements exceptionally clear. Computing \texttt{fib} requires no privilege at all, operating purely on numbers on the stack. Once the values are computed they are printed using the \texttt{println!} macro, which prints to stdout. Therefore the only privilege this application requires at runtime is access to stdout.
|
||||
|
||||
\begin{listing}
|
||||
\begin{minted}{rust}
|
||||
@ -861,14 +861,14 @@ To run this application as a void process we require a specification (§\ref{sec
|
||||
\label{lst:fibonacci-application-spec}
|
||||
\end{listing}
|
||||
|
||||
More of the advanced features of the system will be shown in the future examples, but this is enough to get a basic application up and running. We can see that the Rust application looks exactly like it would without the shim, at least for now. The application is also fully deprivileged. Of course, for an application as small as this example, we can verify by hand that the program has no foul effects. We can imagine a trivial extension that would make this program more dangerous: using a user argument (a privilege the program does not currently have) to take a value on which to execute fib. One way this user input could cause damage is with flawed usage of a logging library. The recent example of Log4j2 with CVE-2021-44228 springs to mind, enabling an attacker with string control to execute arbitrary code from the Internet. A void process with privilege of only arguments and stdout would protect well against this vulnerability, as not only is there no Internet access to pull remote code, but there is nothing to take advantage of in the process even if remote code execution is gained.
|
||||
More of the advanced features of the system will be shown in the second example, but this is enough to get a basic application up and running. We can see that the Rust application looks exactly like it would without the shim. The application is also fully deprivileged. Of course, for an application as small as this example, we can verify by hand that the program has no foul effects. We can imagine a trivial extension that would make this program more dangerous: using a user argument (a privilege the program does not currently have) to take a value on which to execute fib. One way this user input could cause damage is with flawed usage of a logging library. The recent widespread example is Log4j2 with CVE-2021-44228, enabling an attacker with string control to execute arbitrary code from the Internet. A void process with privilege of only arguments and stdout would protect well against this vulnerability, as not only is there no Internet access to pull remote code, but there is nothing to take advantage of in the process even if remote code execution is gained.
|
||||
|
||||
\subsection{Performance}
|
||||
\label{sec:fib-performance}
|
||||
|
||||
In Section \ref{sec:void-creation-costs} testing showed that creating all of the namespaces needed for a void can have extremely high overhead compared to creating a simple new process. Now that a basic application exists to evaluate this on, the latency of the final shim executing an application can be tested.
|
||||
|
||||
Figure \ref{fig:fib-launch-times} shows the difference in spawning an application directly and spawning it with the shim (the Fibonacci application in this section can be launched either way). A C application with a tight for loop is compiled, which calls \texttt{vfork(2)} followed by \texttt{wait(2)}, again using high precision \texttt{CLOCK\_MONOTONIC} timings. The \texttt{vfork(2)} call calls \texttt{execv(2)} immediately, in the direct case with the Fibonacci binary itself, and in the shim case with the shim with the Fibonacci specification and binary as arguments.
|
||||
Figure \ref{fig:fib-launch-times} shows the difference in spawning an application directly and spawning it with the shim (the Fibonacci application is compatible with both). A C application with a tight for loop is compiled, which calls \texttt{vfork(2)} followed by \texttt{wait(2)}, again using high precision \texttt{CLOCK\_MONOTONIC} timings. The child process calls \texttt{execv(2)} immediately, in the direct case with the Fibonacci binary itself, and in the shim case with the shim with the Fibonacci specification and binary as arguments.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
@ -878,12 +878,12 @@ Figure \ref{fig:fib-launch-times} shows the difference in spawning an applicatio
|
||||
\label{fig:fib-launch-times}
|
||||
\end{figure}
|
||||
|
||||
The results demonstrate both significantly higher median latency and significantly higher variance in the results. The machine is the same as that mentioned in Section \ref{sec:void-creation-costs}. Although my virtual machine was not particularly busy at the time of testing, it appeared that the underlying host was, affecting times due to virtual machine context switching within the hypervisor. The underlying trend of a significant overhead remained throughout testing.
|
||||
The results demonstrate both significantly higher median latency and significantly higher variance in the results. The machine is the same as that mentioned in Section \ref{sec:void-creation-costs}. Although my virtual machine was not particularly busy at the time of testing, it appeared that the underlying host was, affecting times due to virtual machine context switching within the hypervisor. The underlying trend of a significant overhead remained throughout testing. This thesis focused on user-space, but future work in kernel-space should be able to reduce this overhead (§\ref{sec:future-work-kernel-api}).
|
||||
|
||||
\section{TLS Server}
|
||||
\section{TLS server}
|
||||
\label{sec:building-tls}
|
||||
|
||||
Rather than presenting the complete applications as shown in the previous two sections, the TLS server presents instead a case study on designing applications from the ground up to run as void processes. The thought process behind data flow design and taking advantage of the more advanced void orchestrator features is given. This results in the process separation presented in Figure \ref{fig:tls-server-processes}. First we must accept TCP requests from the end user (§\ref{sec:building-tls-tcp-listener}). Then, to be able to check that all is working so far, we respond to these requests (§\ref{sec:building-tls-http-handler}). Finally, we add an encryption layer using TLS (\ref{sec:building-tls-tls-handler}). This results in a functional TLS file server with strong privilege separation, with each stage having no more privilege than it needs.
|
||||
Rather than presenting the complete applications as shown in the previous two sections, the TLS server presents instead a case study on designing applications from the ground up to run as void processes. The thought process behind data flow design and taking advantage of the more advanced void orchestrator features is given. This results in the process separation presented in Figure \ref{fig:tls-server-processes}. First we must accept TCP requests from the end user (§\ref{sec:building-tls-tcp-listener}). Then, to be able to check that all is working so far, we respond to these requests (§\ref{sec:building-tls-http-handler}). Finally, we add an encryption layer using TLS (\ref{sec:building-tls-tls-handler}). This results in a functional TLS file server with strong privilege separation, with each stage having exactly the privilege that it requires.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
@ -896,7 +896,7 @@ Rather than presenting the complete applications as shown in the previous two se
|
||||
\subsection{TCP listener}
|
||||
\label{sec:building-tls-tcp-listener}
|
||||
|
||||
The special privilege required by a process which accepts TCP connections is a listening TCP socket. As discussed in Section \ref{sec:filling-net}, TCP listening sockets are handed already bound to void processes. This enables a capability model for network access, otherwise restricting inbound and outbound networking entirely. The specification for this listener is given in Listing \ref{lst:tls-tcp-listener-spec}, where the TCP listener is requested as an argument already bound. No other permissions are required to accept connections from a TCP listener. Although the code at each stage is omitted for brevity, the resulting program has to parse the argument back into an integer and then a \texttt{TcpStream} before looping to receive incoming connections. When building and debugging software it is often useful to have access to the \texttt{stdout} or \texttt{stderr} streams, even though they won't be utilised in production. The void orchestrator provides useful \texttt{--stdout} and \texttt{--stderr} flags to temporarily privilege an application for debugging without modifying its specification. Of course, we can't do much useful with them without more privilege. Thus we move on to developing the HTTP handler.
|
||||
The special privilege required by a process which accepts TCP connections is a listening TCP socket. As discussed in Section \ref{sec:filling-net}, TCP listening sockets are handed already bound to void processes. This enables a capability model for network access, otherwise restricting inbound and outbound networking entirely. The specification for this listener is given in Listing \ref{lst:tls-tcp-listener-spec}, where the TCP listener is requested as an argument already bound. No other permissions are required to accept connections from a TCP listener. Although the code at each stage is omitted for brevity, the resulting program has to parse the argument back into an integer and then a \texttt{TcpStream} before looping to receive incoming connections. When building and debugging software it is often useful to have access to the \texttt{stdout} or \texttt{stderr} streams, even though they won't be utilised in production. The void orchestrator provides useful \texttt{--stdout} and \texttt{--stderr} flags to temporarily privilege an application for debugging without modifying its specification. Having a process that can accept TCP streams but no nothing else is of little use, except for a simple hello world example or echo example. We move on to developing the HTTP handler to see useful output from the application.
|
||||
|
||||
\begin{listing}
|
||||
\begin{minted}{json}
|
||||
@ -916,9 +916,9 @@ The special privilege required by a process which accepts TCP connections is a l
|
||||
|
||||
When attempting to add the HTTP handler, we immediately require more privilege. As this is intended to be a file server, we need some files. Although it would be easy to add files to the existing entrypoint, the principle of least privilege is highly encouraged when developing a void process. One should always ask whether an entrypoint needs a new privilege that they are about to add to it, or whether they would be better served with a new entrypoint.
|
||||
|
||||
In this case, we are going to add a new entrypoint for two reasons: multiprocessing and privilege separation. This allows the TCP listener entrypoint to continue in a tight loop, accepting requests very quickly and fanning them out to new processes. These new processes have only their required privileges: the files they wish to serve, and the \texttt{TcpStream} to serve them down. We take advantage here of another feature of the void orchestrator, file socket based triggers. These allow a statically defined socket to be setup which the void orchestrator will listen on and create new void processes on demand. Further, this ensures isolation between requests too, meaning that a single failed request that causes a process to fail will not affect any others, and a compromised process can't leak information about any other requests either.
|
||||
In this case, we are going to add a new entrypoint for two reasons: multiprocessing and privilege separation. This allows the TCP listener entrypoint to continue in a tight loop, accepting requests very quickly and fanning them out to new processes. These new processes have only their required privileges: the files they wish to serve, and the \texttt{TcpStream} to serve them over. We take advantage here of another feature of the void orchestrator, file socket based triggers. These allow a statically defined socket to be setup which the void orchestrator will listen on and create new void processes on demand. Further, this ensures isolation between requests, meaning that a single failed request which causes a process to fail will not affect any others, and a compromised process can't leak information about any other requests.
|
||||
|
||||
The HTTP handler entrypoint is added to the specification in Listing \ref{lst:tls-http-handler-spec}. As well as adding a single extra argument to trigger the HTTP handler, we must also add an entrypoint argument to differentiate between the two entrypoints. Much like the usage of \texttt{arg0} for symlinked binaries, we utilise \texttt{arg0} to find which intended use of the binary is being called.
|
||||
The HTTP handler entrypoint is added to the specification in Listing \ref{lst:tls-http-handler-spec}. As well as adding a single extra argument to trigger the HTTP handler, we must also add an entrypoint argument to differentiate between the two entrypoints. Much like the usage of \texttt{arg0} for symlinked binaries, we utilise \texttt{arg0} to find which intended use of the binary is being called (Listing \ref{lst:tls-main-function}). Having looked at this specification, we can imagine what the code for each section would look like. Tight privilege bounds act somewhat like type signatures in dictating the behaviour of an application. Future work involves taking more advantage of this link between the two (§\ref{sec:future-work-macros}).
|
||||
|
||||
\begin{listing}
|
||||
\begin{minted}{json}
|
||||
@ -1005,14 +1005,14 @@ The resulting specification is given in Listing \ref{lst:tls-spec}. The TLS hand
|
||||
\label{lst:tls-spec}
|
||||
\end{listing}
|
||||
|
||||
We now have a full specification for a TLS server. In this section I have focused entirely on building up the specification and not the code behind it. There are two reasons for this: the code has a lot of boilerplate argument processing, and a variety of code implementations are available. The boilerplate argument processing could be addressed with future work using features like proc macros in Rust which dynamically generate code based on the code that is already there (§\ref{sec:future-work-macros}). As for varying implementations, I chose to use the static library \texttt{rustls} to implement my TLS server. Perhaps someone else would prefer OpenSSL or LibreSSL, which is of course fine. For the HTTP part I use a random library I found on the Internet to parse HTTP headers before responding only to GET requests. Of course this approach is hugely error prone, but the separation of the HTTP handler from the sensitive TLS material and other parts of the filesystem increases my confidence. The implementation therefore matters very little in this analysis, but is made available at \ifsubmission \url{file:///SUBMITTED_SRC/void-orchestrator/examples/tls/} \else \url{https://github.com/JakeHillion/void-orchestrator/tree/main/examples/tls/} \fi.
|
||||
We now have a full specification for a TLS server. In this section I have focused entirely on building up the specification and not the code behind it. There are two reasons for this: the code has a lot of boilerplate argument processing, and a variety of code implementations are available. The boilerplate argument processing could be addressed with future work using features like proc macros in Rust which dynamically generate code based on input (§\ref{sec:future-work-macros}). As for varying implementations, I chose to use the static library \texttt{rustls} to implement my TLS server. There are many excellent alternatives, such as OpenSSL and LibreSSL, which would work just as well though with different APIs. For HTTP I use a Rust crate for request parsing, before responding only to \texttt{GET} requests. The isolation from sensitive material adds a degree of freedom to request handling, as there is so little which can go wrong. The implementation therefore matters very little in this analysis, but is made available at \ifsubmission\url{file:///SUBMITTED_SRC/void-orchestrator/examples/tls/}\else\url{https://github.com/JakeHillion/void-orchestrator/tree/main/examples/tls/}\fi. Instead the focus is on composability and privilege separation, showing the flexibility of the system.
|
||||
|
||||
\subsection{Performance}
|
||||
\label{sec:tls-performance}
|
||||
|
||||
The performance of the TLS server provides a much more nuanced view than the performance of the trivial Fibonacci example (§\ref{sec:fib-performance}). I evaluated my void process solution against \texttt{apache2}, a highly tuned HTTP server. Both used the same self-signed TLS certificate and were benchmarked with the tool \texttt{ab}, the Apache HTTP server benchmarking tool. The machine setup remains the same as in Section \ref{sec:void-creation-costs}.
|
||||
The performance of the TLS server provides a much more nuanced view than the performance of the trivial Fibonacci example (§\ref{sec:fib-performance}). I evaluated my void process solution against \texttt{apache2}, a highly tuned HTTP server. Both used the same self-signed TLS certificate and were benchmarked with \texttt{ab}, the Apache HTTP server benchmarking tool. The machine setup remains the same as in Section \ref{sec:void-creation-costs}. \texttt{apache2} runs the standard Ubuntu 22.04 LTS config with the TLS module enabled.
|
||||
|
||||
Figure \ref{fig:tls-performance} shows the absolute number of requests per second achieved over the ten second testing period, where \texttt{ab} created 100 concurrent connections and was allowed unlimited total requests. The graph plots the median result and the interquartile range. Figure \ref{fig:tls-relative-performance} shows the same data, but this time divided by the median result of the apache server's benchmark, allowing easier study of the relative performance of the two solutions at scale.
|
||||
Figure \ref{fig:tls-performance} shows the number of requests per second achieved over a ten second testing period, where \texttt{ab} created 100 concurrent connections and was allowed unlimited total requests. The graph plots the median result and the interquartile range. Figure \ref{fig:tls-relative-performance} shows the same data, but this time divided by the median result of the apache server's benchmark, allowing easier analysis of the relative performance of the two solutions.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
@ -1026,15 +1026,15 @@ Figure \ref{fig:tls-performance} shows the absolute number of requests per secon
|
||||
\centering
|
||||
\includegraphics[width=0.8\textwidth]{graphs/tls_relative_performance.png}
|
||||
|
||||
\caption{\texttt{a2bench} requests per second results over 10 seconds with 100 simultaneous requests on varying response sizes. As the response size increases, the gap between the \texttt{apache2} TLS web server and the void process TLS web server decreases.}
|
||||
\caption{\texttt{a2bench} requests per second results over 10 seconds with 100 simultaneous requests on varying response sizes. As the response size increases, the gap between the \texttt{apache2} TLS web server and the void process TLS web server decreases. Results are scaled to the \texttt{apache2} medians, and \texttt{apache2} is shown with no error as it is always 100\% of itself.}
|
||||
\label{fig:tls-relative-performance}
|
||||
\end{figure}
|
||||
|
||||
The trend of the data is that the void process solution handles large request sizes better, while apache handles small request sizes much better. This relates to the high cost of process creation (§\ref{sec:void-creation-costs},§\ref{sec:fib-performance}) associated with void processes, given that two processes need be created for each request (a \texttt{tls\_handler} and a \texttt{http\_handler}). However, combined with large response processes (meaning long lived processes) and high concurrency, the void process server eventually outperforms apache. This highlights the effectiveness of using ordinary file descriptors, even across namespace boundaries, as speed is well maintained.
|
||||
The trend of the data is that the void process solution handles large request sizes better, while \texttt{apache2} handles small request sizes significantly better. This relates to the high cost of process creation (§\ref{sec:void-creation-costs},§\ref{sec:fib-performance}) associated with void processes, given that two processes need be created for each request (a \texttt{tls\_handler} and a \texttt{http\_handler}). However, combined with large response processes (meaning long lived processes) and high concurrency, the void process server eventually outperforms \texttt{apache2}. This highlights the effectiveness of using ordinary file descriptors as performance is maintained, even with maximum privilege separation.
|
||||
|
||||
\section{Summary}
|
||||
|
||||
While avoiding looking at the internals, I've demonstrated how void processes can both run a standard process with no privilege requirements and define a structure for a new application. Explicit definitions of privilege can make it very clear to the programmer where privilege boundaries are, leading to effective privilege separation. The performance changes caused by these designs have been evaluated, where the use of standard file descriptors as capabilities shows that utilising the void orchestrator can achieve acceptable performance with minimal programming effort.
|
||||
I've demonstrated how void processes can both run a standard process with no privilege requirements and define a structure for a new application. Explicit definitions of privilege can make it very clear to the programmer where privilege boundaries are, leading to effective privilege separation. The performance changes caused by these designs have been evaluated, where the use of standard file descriptors as capabilities shows that utilising the void orchestrator can achieve acceptable performance with minimal programming effort.
|
||||
|
||||
|
||||
\chapter{Conclusions}
|
||||
@ -1042,9 +1042,9 @@ While avoiding looking at the internals, I've demonstrated how void processes ca
|
||||
|
||||
The system built in this project enables running applications with minimal privilege in a Linux environment in a novel way. Performance is shown to be comparable, and demonstrates where the existing kernel setup provides inadequate performance for such applications. Design choices in the user-space kernel APIs for namespaces are discussed and contextualised, with suggestions offered for alternate designs.
|
||||
|
||||
Void processes offer a new paradigm for application development which prioritises limitation of privilege. Rather than focusing on limiting backward compatibility, applications often need to be completely rewritten in order to take advantage of improved isolation. The system is designed to support effective static analysis on applications, though this is not implemented at this stage. I present in this work that privilege through explicit choices is a simpler paradigm for programmers than fighting against the moving target of Linux ambient privilege.
|
||||
Void processes offer a new paradigm for application development which prioritises minimal privilege. Applications often need to be completely rewritten in order to take advantage of improved isolation, rather than being limited by backwards compatibility guarantees. The system is designed to support effective static analysis on applications, though this is not implemented at this stage. I present in this work that privilege through explicit choices is a simpler paradigm for programmers than fighting against the moving target of Linux ambient privilege.
|
||||
|
||||
Finally, void processes provide a seamless experience without making kernel level changes, allowing for ease of deployment. Moreover, it runs on the Linux kernel, a production kernel and not a research kernel. Although the current kernel structure limits the performance of the work with namespace creation being the bottleneck, the feasibility of namespaces for process isolation is effectively demonstrated in a system that encourages application writers to develop with privilege separation as a first principle.
|
||||
Finally, void processes provide a seamless experience without making kernel-space changes, allowing for ease of deployment. It runs on the Linux kernel, a production kernel with widespread usage. Although the current kernel structure limits the performance of the work with the namespace creation bottleneck, the feasibility of namespaces for process isolation is effectively demonstrated in a system that encourages application writers to develop with privilege separation as a first principle.
|
||||
|
||||
\section{Future Work}
|
||||
\label{sec:future-work}
|
||||
@ -1052,7 +1052,7 @@ Finally, void processes provide a seamless experience without making kernel leve
|
||||
\subsection{Kernel API improvements}
|
||||
\label{sec:future-work-kernel-api}
|
||||
|
||||
The primary future work to increase the utility of void processes is better performance when creating empty namespaces. Sections \ref{sec:void-creation-costs} and \ref{fig:fib-launch-times} showed that the startup hit when creating the namespaces for a void is very high. This shows a limitation of the APIs, as creating a namespace that has no relation to a parent should involve a small amount of work. Secondly, an API similar to network namespaces adding paired interfaces between namespaces should be added for binding in mount namespaces, allowing mount namespaces to also be created completely empty. This would also benefit containers which by default have no connection to the parent namespace, but need to mount in their own root filesystem.
|
||||
The primary future work to increase the utility of void processes is better performance when creating empty namespaces. Sections \ref{sec:void-creation-costs} and \ref{sec:fib-performance} showed that the startup hit when creating the namespaces for a void is very high. This shows a limitation of the APIs, as creating a namespace that has no relation to a parent should involve a small amount of work. Secondly, an API similar to network namespaces adding paired interfaces between namespaces should be added for binding in mount namespaces, allowing mount namespaces to be created completely empty. This would also benefit containers which by default have no connection to the parent namespace, but need to mount in their own root filesystem.
|
||||
|
||||
\subsection{Dynamic linking}
|
||||
\label{sec:future-work-dynamic-linking}
|
||||
@ -1064,10 +1064,6 @@ Dynamic linking works correctly under the shim, however, it currently requires a
|
||||
|
||||
Much of the information given in the specification and the code is shared. For example, the specification may list the arguments and also imply their type. This means that a function signature for an entrypoint implies almost all of the specification of an entrypoint, which would allow effective code generation with some supplementary information. This would remove many of the boilerplate argument processing lines from the examples and increase the usability of the system. Combining this with the dynamic linking work (§\ref{sec:future-work-dynamic-linking}) would remove a huge amount of the manual effort in creating the specification, making the system more user friendly.
|
||||
|
||||
\subsection{Dynamic requests}
|
||||
|
||||
A system for dynamically requesting statically specified network sockets was presented (§\ref{sec:filling-net}). This system of requests back to the shim could be extended to more dynamic behaviour for software that requires it. Some software, particularly that which interfaces with the user, is not able to statically specify its requirements before starting. By specifying instead a range of requests which are legal then making them dynamically, void processes would be able to support more software.
|
||||
|
||||
\label{lastcontentpage} % end page count here
|
||||
%TC:ignore % end word count here
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user