Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again?

Christian Brauner <brauner@xxxxxxxxxx> · Mon, 26 Feb 2024 16:35:30 +0100

On Mon, Feb 26, 2024 at 10:00:05PM +0800, WANG Xuerui wrote:
> On 2/26/24 21:32, Christian Brauner wrote:
> > On Mon, Feb 26, 2024 at 10:20:23AM +0100, Arnd Bergmann wrote:
> > > On Mon, Feb 26, 2024, at 08:09, Xi Ruoyao wrote:
> > > > On Mon, 2024-02-26 at 07:56 +0100, Arnd Bergmann wrote:
> > > > > On Mon, Feb 26, 2024, at 07:03, Icenowy Zheng wrote:
> > > > > > 在 2024-02-25星期日的 15:32 +0800，Xi Ruoyao写道：
> > > > > > > On Sun, 2024-02-25 at 14:51 +0800, Icenowy Zheng wrote:
> > > > > > > > My idea is this problem needs syscalls to be designed with deep
> > > > > > > > argument inspection in mind; syscalls before this should be
> > > > > > > > considered
> > > > > > > > as historical error and get fixed by resotring old syscalls.
> > > > > > > 
> > > > > > > I'd not consider fstat an error as using statx for fstat has a
> > > > > > > performance impact (severe for some workflows), and Linus has
> > > > > > > concluded
> > > > > > 
> > > > > > Sorry for clearance, I mean statx is an error in ABI design, not fstat.
> > > > 
> > > > I'm wondering why we decided to use AT_EMPTY_PATH/"" instead of
> > > > "AT_NULL_PATH"/nullptr in the first place?
> > > 
> > > Not sure, but it's hard to change now since the libc
> > > implementation won't easily know whether using the NULL
> > > path is safe on a given kernel. It could check the kernel
> > > version number, but that adds another bit of complexity in
> > > the fast path and doesn't work on old kernels with the
> > > feature backported.
> > > 
> > > > But it's not irrational to pass a path to syscall, as long as we still
> > > > have the concept of file system (maybe in 2371 or some year we'll use a
> > > > 128-bit UUID instead of path).
> > > > 
> > > > > The problem I see with the 'use use fstat' approach is that this
> > > > > does not work on 32-bit architectures, unless we define a new
> > > > > fstatat64_time64() syscall, which is one of the things that statx()
> > > > 
> > > > "fstat64_time64".  Using statx for fstatat should be just fine.
> > > 
> > > Right. It does feel wrong to have only an fstat() variant but not
> > > fstatat() if we go there.
> > > 
> > > > Or maybe we can just introduce a new AT_something to make statx
> > > > completely ignore pathname but behave like AT_EMPTY_PATH + "".
> > > 
> > > I think this is better than going back to fstat64_time64(), but
> > > it's still not great because
> > > 
> > > - all the reserved flags on statx() are by definition incompatible
> > >    with existing kernels that return -EINVAL for any flag they do
> > >    not recognize.
> > > 
> > > - you still need to convince libc developers to actually use
> > >    the flag despite the backwards compatibility problem, either
> > >    with a fallback to the current behavior or a version check.
> > > 
> > > Using the NULL path as a fallback would solve the problem with
> > > seccomp, but it would not make the normal case any faster.
> > > 
> > > > > was trying to avoid.
> > > > 
> > > > Oops.  I thought "newstat" should be using 64-bit time but it seems the
> > > > "new" is not what I'd expected...  The "new" actually means "newer than
> > > > Linux 0.9"! :(
> > > > 
> > > > Let's not use "new" in future syscall names...
> > > 
> > > Right, we definitely can't ever succeed. On some architectures
> > > we even had "oldstat" and "stat" before "newstat" and "stat64",
> > > and on some architectures we mix them up. E.g. x86_64 has fstat()
> > > and fstatat64() with the same structure but doesn't define
> > > __NR_newfstat. On mips64, there is a 'newstat' but it has 32-bit
> > > timestamps unlike all other 64-bit architectures.
> > > 
> > > statx() was intended to solve these problems once and for all,
> > > and it appears that we have failed again.
> > 
> > New apis don't invalidate old apis necessarily. That's just not going to
> > work in an age where you have containerized workloads.
> > 
> > statx() is just the beginning of this. A container may have aritrary
> > seccomp profiles that return ENOSYS or even EPERM for whatever reason
> > for any new api that exists. So not implementing fstat() might already
> > break container workloads.
> > 
> > Another example: You can't just skip on implementing mount() and only
> > implement the new mount api for example. Because tools that look for api
> > simplicity and don't need complex setup will _always_ continue to use
> > mount() and have a right to do so.
> > 
> > And fwiw, mount() isn't fully inspectable by seccomp since forever. The
> > list goes on and on.
> > 
> > But let's look at the original mail. Why are they denying statx() and
> > what's that claim about it not being able to be rewritten to something
> > safe? Looking at:
> > 
> > intptr_t SIGSYSFstatatHandler(const struct arch_seccomp_data& args,
> >                                void* fs_denied_errno) {
> >    if (args.nr == __NR_fstatat_default) {
> >      if (*reinterpret_cast<const char*>(args.args[1]) == '\0' &&
> >          args.args[3] == static_cast<uint64_t>(AT_EMPTY_PATH)) {
> >        return syscall(__NR_fstat_default, static_cast<int>(args.args[0]),
> >                       reinterpret_cast<default_stat_struct*>(args.args[2]));
> >      }
> >      return -reinterpret_cast<intptr_t>(fs_denied_errno);
> >    }
> > 
> > What this does it to rewrite fstatat() to fstat() if it was made with
> > AT_EMPTY_PATH and the path argument was "". That is easily doable for
> > statx() because it has the exact same AT_EMPTY_PATH semantics that
> > fstatat() has.
> > 
> > Plus, they can even filter on mask and rewrite that to something that
> > they think is safe. For example, STATX_BASIC_STATS which is equivalent
> > to what any fstat() call returns. So it's pretty difficult to understand
> > what their actual gripe with statx() is.
> > 
> > It can't be that statx() passes a struct because fstatat() and fstat()
> > do that too. So what exactly is that problem there?
> 
> From our investigation:
> 
> For (new)fstatat calls that the sandboxed process may make, this SIGSYS
> handler either:
> 
> * turns allowed calls (those looking at fd's) into fstat's that only have
> one argument (the fd) each, or
> * denies the call,

Yes, but look at the filtering that they do:

if (args.nr == __NR_fstatat_default) {
	if (*reinterpret_cast<const char*>(args.args[1]) == '\0' &&
	    args.args[3] == static_cast<uint64_t>(AT_EMPTY_PATH)) {

So if you have a statx() call instead of an fstatat() call this is
trivially:

if (args.nr == __NR_statx) {
	if (*reinterpret_cast<const char*>(args.args[1]) == '\0' &&
	    args.args[2] == static_cast<uint64_t>(AT_EMPTY_PATH)) {

maybe if they care about it also simply check
args.args[3] == STATX_BASIC_STATS.

And then just as with fstatat() rewrite it to fstat().

> 
> so the sandbox only ever sees fstat calls and no (new)fstatat's, and the
> guarantee that only open fds can ever been stat'ed trivially holds.
> 
> With statx, however, there's no way of guaranteeing "only look at fd"
> semantics without peeking into the path argument, because a non-empty path
> makes AT_EMPTY_PATH ineffective, and the flags are not validated prior to
> use making it near-impossible to introduce new semantics in a
> backwards-compatible manner.

I don't understand. That's exactly the same thing as for fstatat(). My
point is that you can turn statx() into fstat() just like you can turn
fstatat() into fstat(). So if you add fstat()/fstat64() what's left to
do?

> > What this tells me without knowing the exact reason is that they thought
> > "Oh, if we just return ENOSYS then the workload or glibc will just
> > always be able to fallback to fstat() or fstatat()". Which ultimately is
> > the exact same thing that containers often assume.
> > 
> > So really, just skipping on various system calls isn't going to work.
> > You can't just implement new system calls and forget about the rest
> > unless you know exactly what workloads your architecure will run on.
> > 
> > Please implement fstat() or fstatat() and stop inventing hacks for
> > statx() to make weird sandboxing rules work, please.
> 
> We have already provided fstat(at) on LoongArch for a while by
> unconditionally doing statx and translating the returned structure -- see
> the [glibc] and [golang] [golang-2] implementations for example -- without

But you're doing that translation in userspace. I was talking about
adding the fstat()/fstat64() system calls.