Re: Unprivileged filesystem mounts

Christian Brauner <brauner@xxxxxxxxxx> · Tue, 11 Mar 2025 12:01:48 +0100

On Tue, Mar 11, 2025 at 04:57:54PM +1100, Dave Chinner wrote:
> On Mon, Mar 10, 2025 at 10:19:57PM -0400, Demi Marie Obenour wrote:
> > People have stuff to get done.  If you disallow unprivileged filesystem
> > mounts, they will just use sudo (or equivalent) instead.
> 
> I am not advocating that we disallow mounting of untrusted devices.
> 
> > The problem is
> > not that users are mounting untrusted filesystems.  The problem is that
> > mounting untrusted filesystems is unsafe.
> 
> > Making untrusted filesystems safe to mount is the only solution that
> > lets users do what they actually need to do. That means either actually
> > fixing the filesystem code,
> 
> Yes, and the point I keep making is that we cannot provide that
> guarantee from the kernel for existing filesystems. We cannot detect
> all possible malicous tampering situations without cryptogrpahically
> secure verification, and we can't generate full trust from nothing.
> 
> The typical desktop policy of "probe and automount any device that
> is plugged in" prevents the user from examining the device to
> determine if it contains what it is supposed to contain.  The user
> is not given any opportunity to device if trust is warranted before
> the kernel filesystem parser running in ring 0 is exposed to the
> malicious image.
> 
> That's the fundamental policy problem we need to address: the user
> and/or admin is not in control of their own security because
> application developers and/or distro maintainers have decided they
> should not have a choice.
> 
> In this situation, the choice of what to do *must* fall to the user,
> but the argument for "filesystem corruption is a CVE-worthy bug" is
> that the choice has been taken away from the user. That's what I'm
> saying needs to change - the choice needs to be returned to the
> user...
> 
> > or running it in a sufficiently tight
> > sandbox that vulnerabilities in it are of too low importance to matter.
> > libguestfs+FUSE is the most obvious way to do this, but the performance
> > might not be enough for distros to turn it on.
> 
> Yes, I have advocated for that to be used for desktop mounts in the
> past. Similarly, I have also advocated for liblinux + FUSE to be
> used so that the kernel filesystem code is used but run from a
> userspace context where the kernel cannot be compromised.
> 
> I have also advocated for user removable devices to be encrypted by
> default. The act of the user unlocking the device automatically
> marks it as trusted because undetectable malicious tampering is
> highly unlikely.
> 
> I have also advocated for a device registry that records removable
> device signatures and whether the user trusted them or not so that
> they only need to be prompted once for any given removable device
> they use.
> 
> There are *many* potential user-friendly solutions to the problem,
> but they -all- lie in the domain of userspace applications and/or
> policies. This is *not* a problem more or better code in the kernel
> can solve.

Strongly agree.

> 
> Kees and Co keep telling us we should be making changes that make it
> harder (or compeltely prevent) entire classes of vulnerabilities
> from being exploited. Yet every time we suggest that a more secure
> policy should be applied to automounting filesystems to prevent
> system compromise on device hotplug, nobody seems to be willing to
> put security first.

I agree with Dave here a lot.

The case where arbitrary devices stuck into a laptop (e.g., USB sticks)
are mounted isn't solved by making a filesystem mountable unprivileged.
The mounted device cannot show up in the global mount namespace
somewhere since the user doesn't own the initial mount+user namespace.
So it's pointless. In other words, there's filesystem level checks and
mount namespace based checks. Circumventing that restriction means that
any user can just mount the device at any location in the global mount
namespace and therefore simply overmount other stuff.

The other thing is whether or not a filesystem is allowed to be mounted
by an unprivileged user namespaces. That is not a policy decision the
kernel can make, should make, or has to make. This is a road to security
disaster.

The new mount api has built-in
delegation capabilities for exactly this reason and use-case so the
kernel doesn't have to do that. Policy like that belongs into userspace. 
The new mount api makes it possible for userspace to correctly and
safely delegate any filesystem mount to unprivileged users. It's e.g.,
heavily used by bpf to make bpffs and thus bpf usable by unprivileged
userspace and containers.

There's a generic API for this already that we presented on in [1] at
LSFMM 2023. This has proper security policies in place when and how it
is allowed even for a user not in a user namespace to mount an arbitrary
filesystem (device or no device-based).

    NAME
    systemd-mountfsd.service, systemd-mountfsd - Disk Image File System Mount Service

    SYNOPSIS
    systemd-mountfsd.service

    /usr/lib/systemd/systemd-mountfsd

    DESCRIPTION
    systemd-mountfsd is a system service that dissects disk images, and
    returns mount file descriptors for the file systems contained therein to
    clients, via a Varlink IPC API.

    The disk images provided must contain a raw file system image or must
    follow the Discoverable Partitions Specification[1]. Before mounting any
    file systems authenticity of the disk image is established in one or a
    combination of the following ways:

    1. If the disk image is located in a regular file in one of the
       directories /var/lib/machines/, /var/lib/portables/,
       /var/lib/extensions/, /var/lib/confexts/ or their counterparts in the
       /etc/, /run/, /usr/lib/ it is assumed to be trusted.

    2. If the disk image contains a Verity enabled disk image, along with a
       signature partition with a key in the kernel keyring or in
       /etc/verity.d/ (and related directories) the disk image is considered
       trusted.

    This service provides one Varlink[2] service:
    io.systemd.MountFileSystem which accepts a file descriptor to a
    regular file or block device, and returns a number of file
    descriptors referring to an fsmount() file descriptor the client may
    then attach to a path of their choice.

    The returned mounts are automatically allowlisted in the
    per-user-namespace allowlist maintained by
    systemd-nsresourced.service(8).

    The file systems are automatically fsck(8)'ed before mounting.

    NOTES
    1. Discoverable Partitions Specification
       https://uapi-group.org/specifications/specs/discoverable_partitions_specification/

    2. Varlink
       https://varlink.org/

This work has now also been expanded to cover plain directory trees and
will be available in the next release.

It is currently part of systemd but like with a lot of other such tools
they are available standalone for non-systemd systems and if not that
can be done.

[1]: https://youtu.be/RbMhupT3Dk4?si=pIGH5XPPUJ0m6bi0