Re: [PATCH 02/10] vfs: syscall: Add move_mount(2) to move mounts around

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Mon, 8 Jul 2019 19:01:32 +0100

On Mon, Jul 08, 2019 at 12:12:21PM -0500, Eric W. Biederman wrote:

> Al you do realize that the TOCTOU you are talking about comes the system
> call API.  TOMOYO can only be faulted for not playing in their own
> sandbox and not reaching out and fixing the vfs implementation details.
>
> Userspace has always had to very careful to only mount filesystems
> on paths that root completely controls and won't change.

That has nothing whatsoever to do with the path where you are mounting
something.  _That_ is actually looked up before ->sb_mount() gets called;
no TOCTOU there.

The thing where ->sb_mount() is fucked by design is its handling of
	* device name
	* old tree in mount --bind
	* old tree in mount --move
	* things like journal name (not that any of the instances had tried
to do anything with that)

All of those *do* have TOCTOU, and that's an inevitable result of the
idiotic hook fetishism of LSM design.  Instead of "we want something
to happen when such-and-such predicate is about to change", it's
"lemme run my code, the earlier the better, I don't care about any
damn predicates, it's all too complicated anyway, whaddya mean
racy?"

Any time you have pathname resolution done twice, it's a built-in race.
If you want *ALL* checks on mount(2) to be done before the mean, nasty
kernel code gets to decide anything (bind/move/mount/etc. all squashed
together, just let us have at the syscall arguments, mmkay?) - that's
precisely what you get.

And no, that TOCTOU is not in syscall API.  "open() of an untrusted
pathname may end up trying to open hell knows what" is one thing;
"open() of an untrusted pathname may apply MAC checks to one object
and open something entirely different" is another.  The former is
inherent to syscall API.  The latter would be a badly fucked up
implementation (we don't have that issue on open(2), thankfully).

To make it clear, TOMOYO is not at fault here; LSM "architecture" is.
Note, BTW, that TOMOYO checks there do *NOT* limit the input pathname
at all - only the destination of the first pathwalk.  Repeating it
may easily lead to an entirely different place.

Canonicalized pathname is derived from pathwalk result; having concluded
that it's perfectly fine for the operation requested is pure security
theatre - it
	* says nothing about the trustedness of the original pathname
	* may have nothing whatsoever to the object yielded by the
second pathwalk, which is what'll end up actually used.
It's not even "this thing walks through /proc, and thus not to be trusted
to be stable" - the checks won't notice where the damn thing had been.

When somebody proposes _useful_ MAC for mount --move (and that really
can't be done at the level of syscall entry - we need to have already
figured out that with given combination of flags the 1st argument of
mount(2) will be a pathname *and* already looked it up), sure - it
will be added to do_move_mount(), which is where we have all lookups
done, and apply both for mount() and move_mount().

Right now anyone relying upon DAC enforced for MS_MOVE has worse problems
than "attacker will use move_mount(2) and bypass my policy" - the same
attacker can bloody well bypass those with nothing more exotic than
clone(2) and dup2(2) (and mount(2), of course).

And it's not just MS_MOVE (or MS_BIND).  Anyone trying to prevent
mounting e.g. ext2 from untrusted device and do that on the level of
->sb_mount() *is* *bloody* *well* *fucked*.  ->sb_mount() is simply
the wrong place for that.