Re: Unprivileged filesystem mounts

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 18 Mar 2025 16:21:48 +1100

On Tue, Mar 11, 2025 at 04:10:42PM -0400, Demi Marie Obenour wrote:
> On Tue, Mar 11, 2025 at 04:57:54PM +1100, Dave Chinner wrote:
> > On Mon, Mar 10, 2025 at 10:19:57PM -0400, Demi Marie Obenour wrote:
> > > People have stuff to get done.  If you disallow unprivileged filesystem
> > > mounts, they will just use sudo (or equivalent) instead.
> > 
> > I am not advocating that we disallow mounting of untrusted devices.
> > 
> > > The problem is
> > > not that users are mounting untrusted filesystems.  The problem is that
> > > mounting untrusted filesystems is unsafe.
> > 
> > > Making untrusted filesystems safe to mount is the only solution that
> > > lets users do what they actually need to do. That means either actually
> > > fixing the filesystem code,
> > 
> > Yes, and the point I keep making is that we cannot provide that
> > guarantee from the kernel for existing filesystems. We cannot detect
> > all possible malicous tampering situations without cryptogrpahically
> > secure verification, and we can't generate full trust from nothing.
> 
> Why is it not possible to provide that guarantee?  I'm not concerned
> about infinite loops or deadlocks.  Is there a reason it is not possible
> to prevent memory corruption?

You're asking me to prove that the on-disk filesystem format parsing
implementation is 100% provably correct. Not only that, you're
wanting me to say that journal replay copying incomplete,
unverifiable structure fragments over the top of existing disk
structures is 100% provably correct.

I am the person whole architected the existing metadata validation
infrastructure that XFS uses, and so I know it's limitations in
intimate detail. It is, by far, the closest thing we have to
complete runtime metadata validation in any Linux filesystem
(except maybe bcachefs), but it is nowhere near able to detect and
prevent 100% of potential structure corruptions.

It is *far from trivial* to validate all the weird corner cases that
exist in the on-disk format that have evolved over the last 3
decades. For the first 15 years of development, almost zero thought
was given to runtime validation of the on-disk format. People even
fought against introducing it at all. And despite this, we still
have to support the on-disk functionality those old, difficult to
validate, persistent structures describe.

[ And then there's some other random memory corruption bug in the
code, and all bets are off... ]

IOWs, no filesystem developer is ever going to give you a guarantee
that a filesystem implementation is free from memory corruption bugs
unless they've designed and implemented from the ground up to be
100% safe from such issues. No such filesystem exists in the kernel,
and it will probably be years away before anything may exist to fill
that gap.

> > The typical desktop policy of "probe and automount any device that
> > is plugged in" prevents the user from examining the device to
> > determine if it contains what it is supposed to contain.  The user
> > is not given any opportunity to device if trust is warranted before
> > the kernel filesystem parser running in ring 0 is exposed to the
> > malicious image.
> > 
> > That's the fundamental policy problem we need to address: the user
> > and/or admin is not in control of their own security because
> > application developers and/or distro maintainers have decided they
> > should not have a choice.
> > 
> > In this situation, the choice of what to do *must* fall to the user,
> > but the argument for "filesystem corruption is a CVE-worthy bug" is
> > that the choice has been taken away from the user. That's what I'm
> > saying needs to change - the choice needs to be returned to the
> > user...
> 
> I am 100% in favor of not automounting filesystems without user
> interaction, but that only means that an exploit will require user
> interaction.  Users need to get things done, and if their task requires
> them to a not-fully-trusted filesystem image, then that is what they
> will do, and they will typically do it in the most obvious way possible.
> That most obvious way needs to be a safe way, and it needs to have good
> enough performance that users don't go around looking for an unsafe way.

Well, yes, that is obvious, and not a point of contention at all,
as is evidenced by the list of solutions to this problem I outlined.

> > > or running it in a sufficiently tight
> > > sandbox that vulnerabilities in it are of too low importance to matter.
> > > libguestfs+FUSE is the most obvious way to do this, but the performance
> > > might not be enough for distros to turn it on.
> > 
> > Yes, I have advocated for that to be used for desktop mounts in the
> > past. Similarly, I have also advocated for liblinux + FUSE to be
> > used so that the kernel filesystem code is used but run from a
> > userspace context where the kernel cannot be compromised.
> > 
> > I have also advocated for user removable devices to be encrypted by
> > default. The act of the user unlocking the device automatically
> > marks it as trusted because undetectable malicious tampering is
> > highly unlikely.
> 
> That is definitely a good idea.
> 
> > I have also advocated for a device registry that records removable
> > device signatures and whether the user trusted them or not so that
> > they only need to be prompted once for any given removable device
> > they use.
> > 
> > There are *many* potential user-friendly solutions to the problem,
> > but they -all- lie in the domain of userspace applications and/or
> > policies. This is *not* a problem more or better code in the kernel
> > can solve.
> 
> It is certainly possible to make a memory safe implementation of amy
> filesystem.

Spoken like a True Expert.

> If the current implementation can't prevent memory
> corruption if a malicious filesystem is mounted, that is a
> characteristic of the implementation.

Ah, now I see what you are trying to do. You're building a strawman
around memory corruption that you can use the argument "we need to
reimplement everything in Rust" to knock down.

Sorry, not playing that game.

> However, the root filesystem is not the only filesystem image that must
> be mounted.  There is also a writable data volume, and that _cannot_ be
> signed because it contains user data.  It is encrypted, but part of the
> threat model for both Android and ChromeOS is an attacker who has gained
> root or even kernel code execution and wants to retain their access
> across device reboots. They can't tamper with the kernel or root
> filesystem, and privileged userspace treats the data on the writable
> filesystem as untrusted.  However, the attacker can replace the writable
> filesystem image with anything they want,

And therein lies the attack a fielsystem implementation can't defend
against: the attacker can rewrite the unencrypted block device to
contain anything they want, and that will then pass verification on
the next boot. Perhaps that's the class of storage attack you should
seek to prevent, not try to slap bandaids over trust model
violations or insinuate the only solution is to rewrite complex
subsystems in Rust....

-Dave.

-- 
Dave Chinner
david@xxxxxxxxxxxxx