[PATCH -0/1] Enhancing VFS isolation

李志 <lizhi16@xxxxxxxxxxx> · Sun, 25 Jun 2023 16:06:25 +0800 (GMT+08:00)

Dear all,

In the past 6 years, a type of path misresolution risks have persisted on nearly all container tools (e.g., Docker, Podman, Kubernetes and etc.) and are responsible for nearly half of the 27 high-severity vulnerabilities. Specifically, during the container tools' interaction with the container’s filesystem, the container tools with the host's root privilege might be induced to execute an illegal file in the malicious container (e.g., CVE-2019-14271) or cheated to resolve a malicious symlink belonging to a container into the outside of the container (e.g., CVE-2017-1002101). The problem comes from today's “one-way” isolation of the in-container filesystem: although the host's resources outside the container filesystem is invisible to the containerized application, the host executables (including the container tool and the components it depends on) does not see any constraints in visiting the in-container filesystem.

We find that this security risk cannot be effectively controlled in the userland, by the container tools. The existing vulnerabilities show that the third-party components called by the container tools usually break this kind of control unintentionally. Thus, kernel-based filesystem isolation becomes the only viable solution to comprehensively mediating filesystem accesses from different kinds of third-party components to ensure isolation is always in place during host-container interactions. 

For now, the mount namespace and the pivot_root take charge of the filesystem isolation for the container. The mount namespace only segregates the mount points between the container and the host and combines the pivot_root to confine the container application's view within a given path. As a result, from  the virtual filesystem (VFS), the kernel cannot tell whether a directory entry (dentry) object belongs to a container or not. This renders any illegal access to the filesystem hard to identify, as long as the path of the access request can be translated into the dentry object through the VFS. 

In this case, we propose to extend the filesystem isolation to dentry objects, ensuring full mediation of host-container filesystem-related interactions. For this purpose, we patch some functions in the VFS implementation. Firstly, we extend syscall ‘pivot_root’ (fs/namespace.c) to tag dentries according to their relations with containers. And we reload the path lookup process in VFS and enforce a set of carefully designed policies in ‘complete_walk()’ (fs/namei.c) to regulate the access to these objects.

Is this idea appropriate to enhance the isolation of the VFS? 

Thanks!