Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again?

WANG Xuerui <kernel@xxxxxxxxxx> · Tue, 27 Feb 2024 01:38:52 +0800

On 2/26/24 23:35, Christian Brauner wrote:
On Mon, Feb 26, 2024 at 10:00:05PM +0800, WANG Xuerui wrote:
On 2/26/24 21:32, Christian Brauner wrote:
On Mon, Feb 26, 2024 at 10:20:23AM +0100, Arnd Bergmann wrote:
On Mon, Feb 26, 2024, at 08:09, Xi Ruoyao wrote:
On Mon, 2024-02-26 at 07:56 +0100, Arnd Bergmann wrote:
On Mon, Feb 26, 2024, at 07:03, Icenowy Zheng wrote:
在 2024-02-25星期日的 15:32 +0800，Xi Ruoyao写道：
On Sun, 2024-02-25 at 14:51 +0800, Icenowy Zheng wrote:
My idea is this problem needs syscalls to be designed with deep
argument inspection in mind; syscalls before this should be
considered
as historical error and get fixed by resotring old syscalls.
I'd not consider fstat an error as using statx for fstat has a
performance impact (severe for some workflows), and Linus has
concluded
Sorry for clearance, I mean statx is an error in ABI design, not fstat.
I'm wondering why we decided to use AT_EMPTY_PATH/"" instead of
"AT_NULL_PATH"/nullptr in the first place?
Not sure, but it's hard to change now since the libc
implementation won't easily know whether using the NULL
path is safe on a given kernel. It could check the kernel
version number, but that adds another bit of complexity in
the fast path and doesn't work on old kernels with the
feature backported.

But it's not irrational to pass a path to syscall, as long as we still
have the concept of file system (maybe in 2371 or some year we'll use a
128-bit UUID instead of path).

The problem I see with the 'use use fstat' approach is that this
does not work on 32-bit architectures, unless we define a new
fstatat64_time64() syscall, which is one of the things that statx()
"fstat64_time64".  Using statx for fstatat should be just fine.
Right. It does feel wrong to have only an fstat() variant but not
fstatat() if we go there.

Or maybe we can just introduce a new AT_something to make statx
completely ignore pathname but behave like AT_EMPTY_PATH + "".
I think this is better than going back to fstat64_time64(), but
it's still not great because

- all the reserved flags on statx() are by definition incompatible
    with existing kernels that return -EINVAL for any flag they do
    not recognize.

- you still need to convince libc developers to actually use
    the flag despite the backwards compatibility problem, either
    with a fallback to the current behavior or a version check.

Using the NULL path as a fallback would solve the problem with
seccomp, but it would not make the normal case any faster.

was trying to avoid.
Oops.  I thought "newstat" should be using 64-bit time but it seems the
"new" is not what I'd expected...  The "new" actually means "newer than
Linux 0.9"! :(

Let's not use "new" in future syscall names...
Right, we definitely can't ever succeed. On some architectures
we even had "oldstat" and "stat" before "newstat" and "stat64",
and on some architectures we mix them up. E.g. x86_64 has fstat()
and fstatat64() with the same structure but doesn't define
__NR_newfstat. On mips64, there is a 'newstat' but it has 32-bit
timestamps unlike all other 64-bit architectures.

statx() was intended to solve these problems once and for all,
and it appears that we have failed again.
New apis don't invalidate old apis necessarily. That's just not going to
work in an age where you have containerized workloads.

statx() is just the beginning of this. A container may have aritrary
seccomp profiles that return ENOSYS or even EPERM for whatever reason
for any new api that exists. So not implementing fstat() might already
break container workloads.

Another example: You can't just skip on implementing mount() and only
implement the new mount api for example. Because tools that look for api
simplicity and don't need complex setup will _always_ continue to use
mount() and have a right to do so.

And fwiw, mount() isn't fully inspectable by seccomp since forever. The
list goes on and on.

But let's look at the original mail. Why are they denying statx() and
what's that claim about it not being able to be rewritten to something
safe? Looking at:

intptr_t SIGSYSFstatatHandler(const struct arch_seccomp_data& args,
                                void* fs_denied_errno) {
    if (args.nr == __NR_fstatat_default) {
      if (*reinterpret_cast<const char*>(args.args[1]) == '\0' &&
          args.args[3] == static_cast<uint64_t>(AT_EMPTY_PATH)) {
        return syscall(__NR_fstat_default, static_cast<int>(args.args[0]),
                       reinterpret_cast<default_stat_struct*>(args.args[2]));
      }
      return -reinterpret_cast<intptr_t>(fs_denied_errno);
    }

What this does it to rewrite fstatat() to fstat() if it was made with
AT_EMPTY_PATH and the path argument was "". That is easily doable for
statx() because it has the exact same AT_EMPTY_PATH semantics that
fstatat() has.

Plus, they can even filter on mask and rewrite that to something that
they think is safe. For example, STATX_BASIC_STATS which is equivalent
to what any fstat() call returns. So it's pretty difficult to understand
what their actual gripe with statx() is.

It can't be that statx() passes a struct because fstatat() and fstat()
do that too. So what exactly is that problem there?
 From our investigation:

For (new)fstatat calls that the sandboxed process may make, this SIGSYS
handler either:

* turns allowed calls (those looking at fd's) into fstat's that only have
one argument (the fd) each, or
* denies the call,
Yes, but look at the filtering that they do:

if (args.nr == __NR_fstatat_default) {
	if (*reinterpret_cast<const char*>(args.args[1]) == '\0' &&
	    args.args[3] == static_cast<uint64_t>(AT_EMPTY_PATH)) {

So if you have a statx() call instead of an fstatat() call this is
trivially:

if (args.nr == __NR_statx) {
	if (*reinterpret_cast<const char*>(args.args[1]) == '\0' &&
	    args.args[2] == static_cast<uint64_t>(AT_EMPTY_PATH)) {

maybe if they care about it also simply check
args.args[3] == STATX_BASIC_STATS.

And then just as with fstatat() rewrite it to fstat().

But fstat() and fstatat() share the same return value type i.e. struct 
stat, different from struct statx. And they are different enough that 
their existing seccomp policy can distinguish. In the statx-only case 
though, the seccomp policy cannot distinguish "statx actually called 
with empty path" from "statx called with AT_EMPTY_PATH but non-empty 
path" because in both cases the path would be a non-NULL pointer opaque 
to the policy cBPF program.

so the sandbox only ever sees fstat calls and no (new)fstatat's, and the
guarantee that only open fds can ever been stat'ed trivially holds.

With statx, however, there's no way of guaranteeing "only look at fd"
semantics without peeking into the path argument, because a non-empty path
makes AT_EMPTY_PATH ineffective, and the flags are not validated prior to
use making it near-impossible to introduce new semantics in a
backwards-compatible manner.
I don't understand. That's exactly the same thing as for fstatat(). My
point is that you can turn statx() into fstat() just like you can turn
fstatat() into fstat(). So if you add fstat()/fstat64() what's left to
do?
Yes, once fstat is restored it's a matter of transforming every allowed 
statx into fstat, then translating struct stat back into struct statx. 
What we're seeking is a possible way forward without re-introducing that 
though, because we still have some time and don't have to rush.
What this tells me without knowing the exact reason is that they thought
"Oh, if we just return ENOSYS then the workload or glibc will just
always be able to fallback to fstat() or fstatat()". Which ultimately is
the exact same thing that containers often assume.

So really, just skipping on various system calls isn't going to work.
You can't just implement new system calls and forget about the rest
unless you know exactly what workloads your architecure will run on.

Please implement fstat() or fstatat() and stop inventing hacks for
statx() to make weird sandboxing rules work, please.
We have already provided fstat(at) on LoongArch for a while by
unconditionally doing statx and translating the returned structure -- see
the [glibc] and [golang] [golang-2] implementations for example -- without
But you're doing that translation in userspace. I was talking about
adding the fstat()/fstat64() system calls.
Hmm, yeah. I meant to provide some more context but I later realized 
that the sandbox is in charge of rewriting the syscalls from inside the 
sandbox, so the userland may not matter in the big picture after all.

--
WANG "xen0n" Xuerui

Linux/LoongArch mailing list: https://lore.kernel.org/loongarch/