mirror of
https://git.overleaf.com/6227c8e96fcdc06e56454f24
synced 2024-12-22 21:43:05 +00:00
Update on Overleaf.
This commit is contained in:
parent
1078e34d73
commit
e5781773cc
@ -198,9 +198,17 @@ Void Processes take advantage of modern Linux namespaces to run applications wit
|
||||
This work explores the question of what is an operating system by taking a novel approach to running applications with the system exposed in a very different way. Rather than limiting the access of a process or set of processes to the operating system, such as in containers, we instead limit the access to the operating system with more explicit methods per process. Interaction between processes is allowed by specifying such interaction statically at compile time, removing any separation between the application developer and the system controlling access to the application, unlike solutions such as SELinux \citep{loscocco_security-enhanced_2000}.
|
||||
\fi
|
||||
|
||||
The question of what makes an operating system has been asked many times. This work looks for an answer by running applications in a very different way.
|
||||
The question of what makes an operating system has been asked many times. This work looks for an answer by running applications in a very different way. There have previously been many attempts to redefine an operating system. Comparing this work with two of those, unikernels and containers, we can form a line. Unikernels abandon the monolithic kernel in favour of a slimmed down kernel that only provides the features the user needs, limiting the trusted computing base but requiring special purpose applications to be written. Containers provide a view of an isolated system while sharing the monolithic kernel with the host, allowing almost any application that can run on Linux to run in a Linux Container, but including all of the features and security holes that come with running a monolithic kernel. Void Processes lie between the two. While they still rely on the monolithic kernel for isolation and inter-process communication, further reliance on the kernel is limited as much as possible, reducing the attack surface. While much of the Linux experience is made unavailable, the core calls remain the same, such as operations on file descriptors. By having nothing available at all by default, this creates an environment where every feature required must be added in, similar to unikernels. Unlike unikernels, Void Processes allow you to run nearly anything supported on a Linux environment with only minor code tweaks.
|
||||
|
||||
\todo{Comparison to unikernels.}
|
||||
\begin{itemize}
|
||||
\item Unikernels
|
||||
\item Void Processes
|
||||
\item Containers
|
||||
\item Virtual Machines
|
||||
\item Bare Metal Linux
|
||||
\end{itemize}
|
||||
|
||||
\todo{Convert this list to a figure.}
|
||||
|
||||
|
||||
\begin{table*}
|
||||
@ -310,7 +318,7 @@ Although PID namespaces work quite well for creating a Void Process from the per
|
||||
|
||||
The first behaviour shown is that an \texttt{unshare(CLONE\_PID)} call followed immediately by an \text{exec} does not have the desired behaviour. The reason for this is that the first process created in the new namespace is given PID 1 and acts as an init process. That is, whichever process the shell spawns first becomes the init process of the namespace, and when that process dies, the namespace can no longer create new processes. This behaviour is avoided by either calling \texttt{unshare(2)} followed by \texttt{fork(2)}, or utilising \texttt{clone(2)} instead. The \texttt{unshare(1)} binary provides a fork flag to solve this, while the implementation of the Void Orchestrator uses \texttt{clone(2)} which combines the two into a single syscall.
|
||||
|
||||
Secondly, we see that even in a shell that appears to be working correctly, processes from outside of the new PID namespace are still visible. This behaviour occurs because the mount of \texttt{/proc} visible to the process in the new PID namespace is the same as the init process. This is solved by remounting \texttt{/proc}, available to \texttt{unshare(3)} with the \texttt{---mount-proc} flag. Care must be taken that this mount is completed in a new mount namespace, or else processes outside of the PID namespace will be affected. The Void Orchestrator again avoids this by voiding the mount namespace entirely, so any access to proc must be either bound to outside the namespace, or freshly mounted, allowing either required behaviour.
|
||||
Secondly, we see that even in a shell that appears to be working correctly, processes from outside of the new PID namespace are still visible. This behaviour occurs because the mount of \texttt{/proc} visible to the process in the new PID namespace is the same as the init process. This is solved by remounting \texttt{/proc}, available to \texttt{unshare(3)} with the \texttt{---mount-proc} flag. Care must be taken that this mount is completed in a new mount namespace, or else processes outside of the PID namespace will be affected. The Void Orchestrator again avoids this by voiding the mount namespace entirely, so any access to proc must be either bound to outside the namespace deliberately or freshly mounted.
|
||||
|
||||
\lstset{caption={Unshare behaviour with PID namespaces, with and without forking and remounting proc.}}
|
||||
\begin{lstlisting}[float,label={lst:unshare-pid}]
|
||||
@ -343,7 +351,7 @@ Mount namespaces were by far the most challenging part of this project. When add
|
||||
|
||||
\subsubsection{Copy-on-Write}
|
||||
|
||||
Comparing to network namespaces, a slightly more modern namespace [Table \ref{tab:namespaces}], we see a huge difference in what occurs when a new namespace is created. When creating a new network namespace, the ideal conditions for a Void Process are created - a network namespace containing only a loopback adapter. That is, the process has no ability to interact with the outside network, and no immediate relation to the parent network namespace. To interact with alternate namespaces, one must explicitly create a connection between the two, or move a physical adapter into the new (empty) namespace. Mount namespaces, rather than creating a new and empty namespace, made the choice to create a copy of the parent namespace, in a copy-on-write fashion. That is, after creating a new mount namespace, the mount hierarchy appears much the same as before. This is shown in Listing \ref{lst:unshare-cat-passwd}, where the file \texttt{/etc/passwd} is shown before and after an unshare, revealing the same content.
|
||||
Comparing to network namespaces, we see a huge difference in what occurs when a new namespace is created. When creating a new network namespace, the ideal conditions for a Void Process are created - a network namespace containing only a loopback adapter. That is, the process has no ability to interact with the outside network, and no immediate relation to the parent network namespace. To interact with alternate namespaces, one must explicitly create a connection between the two, or move a physical adapter into the new (empty) namespace. Mount namespaces, rather than creating a new and empty namespace, made the choice to create a copy of the parent namespace, in a copy-on-write fashion. That is, after creating a new mount namespace, the mount hierarchy appears much the same as before. This is shown in Listing \ref{lst:unshare-cat-passwd}, where the file \texttt{/etc/passwd} is shown before and after an unshare, revealing the same content.
|
||||
|
||||
\lstset{caption={Reading the same file before and after unsharing the mount namespace.}}
|
||||
\begin{lstlisting}[float,label={lst:unshare-cat-passwd}]
|
||||
@ -382,7 +390,6 @@ sys:x:3:3:sys:/dev:/usr/sbin/nologin
|
||||
\end{lstlisting}
|
||||
|
||||
\subsubsection{Shared Subtrees}
|
||||
\label{sec:shared-subtrees}
|
||||
|
||||
While some other namespaces are copy-on-write, for example UTS namespaces, they do not present the same problem as mount namespaces. Although UTS namespaces are copy-on-write, it is trivial to create the conditions for a Void Process by setting the hostname of the machine to a constant. This removes any relation to the parent namespace and to the outside machine. Mount namespaces instead maintain a shared pointer with most filesystems, more akin to not creating a new namespace than a copy-on-write namespace.
|
||||
|
||||
@ -391,6 +398,7 @@ Shared subtrees \citep{pai_shared_2005} were introduced to provide a consistent
|
||||
\begin{figure*}
|
||||
\begin{minipage}{.45\textwidth}
|
||||
|
||||
\lstset{caption={}}
|
||||
\begin{lstlisting}[frame=tlrb,showlines=true]{Name}
|
||||
# unshare -m
|
||||
# mount_container_root /tmp/a
|
||||
@ -406,6 +414,7 @@ Shared subtrees \citep{pai_shared_2005} were introduced to provide a consistent
|
||||
\end{minipage}\hfill
|
||||
\begin{minipage}{.45\textwidth}
|
||||
|
||||
\lstset{caption={}}
|
||||
\begin{lstlisting}[frame=tlrb]{Name}
|
||||
#
|
||||
#
|
||||
@ -426,11 +435,11 @@ file_1 file_2
|
||||
|
||||
\say{Notwithstanding the fact that the default propagation type for new mount is in many cases \texttt{MS\_PRIVATE}, \texttt{MS\_SHARED} is typically more useful. For this reason, \texttt{systemd(1)} automatically remounts all mounts as \texttt{MS\_SHARED} on system startup. Thus, on most modern systems, the default propagation type is in practice \texttt{MS\_SHARED}.}
|
||||
|
||||
This means that when creating a new namespace, mounts and unmounts are propagated by default. Further, it means that mounts and unmounts are propagated out of the namespace. This can be highly confusing behaviour, and \texttt{unshare(1)} considers this behaviour inconsistent with the goals of unsharing - it immediately calls \texttt{mount("none", "/", NULL, MS\_REC|MS\_PRIVATE, NULL)} after \texttt{unshare(CLONE\_NEWNS)}, detaching the new unshared tree. The reasoning for this is that containers created should not present the behaviour given in Figure \ref{fig:shared-subtrees}, and this behaviour is unavoidable unless the parent mounts are shared, while it is possible to disable the behaviour where necessary.
|
||||
This means that when creating a new namespace, mounts and unmounts are propagated by default. More specifically, it means that mounts and unmounts are propagated both from the parent namespace to the child, and from the child namespace to the parent. This can be highly confusing behaviour, as it provides minimal isolation by default. \texttt{unshare(1)} considers this behaviour inconsistent with the goals of unsharing - it immediately calls \texttt{mount("none", "/", NULL, MS\_REC|MS\_PRIVATE, NULL)} after \texttt{unshare(CLONE\_NEWNS)}, detaching the newly unshared tree. The reasoning for enabling \texttt{MS\_SHARED} by default is that containers created should not present the behaviour given in Figure \ref{fig:shared-subtrees}, and this behaviour is unavoidable unless the parent mounts are shared, while it is possible to disable the behaviour where necessary.
|
||||
|
||||
\subsubsection{Lazy unmounting}
|
||||
|
||||
Mount namespaces present further interesting behaviour when unmounting the initial root filesystem. Although this may initially seem isolated to Void Processes, it is also a problem in a container type system. Consider again the container created in Figure \ref{fig:shared-subtrees} - the existing root must be unmounted after pivoting, to avoid keeping the container fully connected to the outside root.
|
||||
Mount namespaces present further interesting behaviour when unmounting the old root filesystem. Although this may initially seem isolated to Void Processes, it is also a problem in a container system. Consider again the container created in Figure \ref{fig:shared-subtrees}: the existing root must be unmounted after pivoting, else the container remains fully connected to the outside root.
|
||||
|
||||
Referring again to network namespaces, sockets continue to exist in their initial namespace, allowing for regular file-descriptor passing semantics \citep{biederman_re_2007}. Extending upon this socket behaviour is Wireguard, which creates adapters that may be freely moved between namespaces while continuing to connect externally from their initial parent \citep[§7.3]{donenfeld_wireguard_2017}.
|
||||
|
||||
@ -451,7 +460,7 @@ if (umount("/"))
|
||||
umount: Device or resource busy
|
||||
\end{lstlisting}
|
||||
|
||||
A feature called lazy unmounting or \texttt{MNT\_DETACH} exists for situations where a busy mount still needs to be unmounted. Supplying the \texttt{MNT\_DETACH} flag to \texttt{umount2(2)} causes the mount to be immediately detached from the unified hierarchy, while remaining mounted internally until the last user has finished with it. While this initially seems like a good solution, this syscall is incredibly dangerous when combined with shared subtrees. This behaviour is shown in Figure \ref{fig:unshare-umount-lazy}, where a lazy (and hence recursive) unmount is combined with a shared subtree to disastrous effect.
|
||||
A feature called lazy unmounting or \texttt{MNT\_DETACH} exists for situations where a busy mount still needs to be unmounted. Supplying the \texttt{MNT\_DETACH} flag to \texttt{umount2(2)} causes the mount to be immediately detached from the unified hierarchy, while remaining mounted internally until the last user has finished with it. Whilst this initially seems like a good solution, this syscall is incredibly dangerous when combined with shared subtrees. This behaviour is shown in Figure \ref{fig:unshare-umount-lazy}, where a lazy (and hence recursive) unmount is combined with a shared subtree to disastrous effect.
|
||||
|
||||
\begin{figure*}
|
||||
\begin{minipage}{.45\textwidth}
|
||||
@ -495,11 +504,11 @@ doesn't exist).}
|
||||
|
||||
This logic held even in the presence of namespaces, with the initial patchset in February 2001 \citep{viro_patch_2001}, as mounts were not initially shared but duplicated between namespaces. However, when shared subtrees were added in January 2005 \citep{viro_rfc_2005}, this logic stopped holding.
|
||||
|
||||
When setting up a container environment, one calls \texttt{pivot\_root(2)} to replace the old root with a new root for the container. Then, the old root may be unmounted. Oftentimes the solution is to exec a binary in the new root first, meaning that the old root is no longer in use and may be unmounted. This works, as old root is only a reference in this namespace, and hence may be unmounted with children - the \texttt{vfsmount} in this namespace is not busy, in contradiction to the quotation.
|
||||
When setting up a container environment, one calls \texttt{pivot\_root(2)} to replace the old root with a new root for the container. Then, the old root may be unmounted. Oftentimes the solution is to exec a binary in the new root first, meaning that the old root is no longer in use and may be unmounted. This works, as old root is only a reference in this namespace, and hence may be unmounted with children - the \texttt{vfsmount} in this namespace is not busy, contradicting an assertion in the quotation.
|
||||
|
||||
If, instead, one wishes to continue running the existing binary, this is possible with lazy unmounting. However, the kernel only exposes a recursive lazy unmount. With shared subtrees, this results in destroying the parent tree. While this is avoidable by removing the shared propagation from the subtree before unmounting, the choice to have \texttt{MNT\_DETACH} aggressively cross shared subtrees can be highly confusing, and perhaps undesired behaviour in a world with shared subtrees by default.
|
||||
|
||||
Mount namespaces were the first [Table \ref{tab:namespaces}] namespaces introduced to Linux, in kernel version 2.5.2 \citep{torvalds_linux_2002}. In contrast to network namespaces, the API is particularly unfriendly to creating a Void Process. The creation of mount namespaces is copy-on-write, and many filesystems are mounted shared. This means that they propagate changes back through namespace boundaries. As the mount namespace does not allow for creating an entirely empty root, extra care must be taken in separating processes. The method taken in this system is mounting a new \texttt{tmpfs} file system in a new namespace, which doesn't propagate to the parent, and using the \texttt{pivot\_root(8)} command to make this the new root. By pivoting to the \texttt{tmpfs}, the old root exists as the only reference in the otherwise empty \texttt{tmpfs}. Finally, after ensuring the old root is set to \texttt{MNT\_PRIVATE} to avoid propagation (more details in §\ref{sec:shared-subtrees}), the old root can be lazily detached. This allows the binary from the parent namespace, the shim in this case, to continue running correctly. Any new processes only have access to the materials in the empty \texttt{tmpfs}. This new \texttt{tmpfs} never appears in the parent namespace, separating the Void Process effectively from the parent namespace.
|
||||
The API is particularly unfriendly to creating a Void Process. The creation of mount namespaces is copy-on-write, and many filesystems are mounted shared. This means that they propagate changes back through namespace boundaries. As the mount namespace does not allow for creating an entirely empty root, extra care must be taken in separating processes. The method taken in this system is mounting a new \texttt{tmpfs} file system in a new namespace, which doesn't propagate to the parent, and using the \texttt{pivot\_root(8)} command to make this the new root. By pivoting to the \texttt{tmpfs}, the old root exists as the only reference in the otherwise empty \texttt{tmpfs}. Finally, after ensuring the old root is set to \texttt{MNT\_PRIVATE} to avoid propagation, the old root can be lazily detached. This allows the binary from the parent namespace, the shim in this case, to continue running correctly. Any new processes only have access to the materials in the empty \texttt{tmpfs}. This new \texttt{tmpfs} never appears in the parent namespace, separating the Void Process effectively from the parent namespace.
|
||||
|
||||
\subsection{user namespaces}
|
||||
\label{sec:voiding-user}
|
||||
|
Loading…
Reference in New Issue
Block a user