On Mon, Feb 26, 2024 at 10:00:05PM +0800, WANG Xuerui wrote: > On 2/26/24 21:32, Christian Brauner wrote: > > On Mon, Feb 26, 2024 at 10:20:23AM +0100, Arnd Bergmann wrote: > > > On Mon, Feb 26, 2024, at 08:09, Xi Ruoyao wrote: > > > > On Mon, 2024-02-26 at 07:56 +0100, Arnd Bergmann wrote: > > > > > On Mon, Feb 26, 2024, at 07:03, Icenowy Zheng wrote: > > > > > > 在 2024-02-25星期日的 15:32 +0800,Xi Ruoyao写道: > > > > > > > On Sun, 2024-02-25 at 14:51 +0800, Icenowy Zheng wrote: > > > > > > > > My idea is this problem needs syscalls to be designed with deep > > > > > > > > argument inspection in mind; syscalls before this should be > > > > > > > > considered > > > > > > > > as historical error and get fixed by resotring old syscalls. > > > > > > > > > > > > > > I'd not consider fstat an error as using statx for fstat has a > > > > > > > performance impact (severe for some workflows), and Linus has > > > > > > > concluded > > > > > > > > > > > > Sorry for clearance, I mean statx is an error in ABI design, not fstat. > > > > > > > > I'm wondering why we decided to use AT_EMPTY_PATH/"" instead of > > > > "AT_NULL_PATH"/nullptr in the first place? > > > > > > Not sure, but it's hard to change now since the libc > > > implementation won't easily know whether using the NULL > > > path is safe on a given kernel. It could check the kernel > > > version number, but that adds another bit of complexity in > > > the fast path and doesn't work on old kernels with the > > > feature backported. > > > > > > > But it's not irrational to pass a path to syscall, as long as we still > > > > have the concept of file system (maybe in 2371 or some year we'll use a > > > > 128-bit UUID instead of path). > > > > > > > > > The problem I see with the 'use use fstat' approach is that this > > > > > does not work on 32-bit architectures, unless we define a new > > > > > fstatat64_time64() syscall, which is one of the things that statx() > > > > > > > > "fstat64_time64". Using statx for fstatat should be just fine. > > > > > > Right. It does feel wrong to have only an fstat() variant but not > > > fstatat() if we go there. > > > > > > > Or maybe we can just introduce a new AT_something to make statx > > > > completely ignore pathname but behave like AT_EMPTY_PATH + "". > > > > > > I think this is better than going back to fstat64_time64(), but > > > it's still not great because > > > > > > - all the reserved flags on statx() are by definition incompatible > > > with existing kernels that return -EINVAL for any flag they do > > > not recognize. > > > > > > - you still need to convince libc developers to actually use > > > the flag despite the backwards compatibility problem, either > > > with a fallback to the current behavior or a version check. > > > > > > Using the NULL path as a fallback would solve the problem with > > > seccomp, but it would not make the normal case any faster. > > > > > > > > was trying to avoid. > > > > > > > > Oops. I thought "newstat" should be using 64-bit time but it seems the > > > > "new" is not what I'd expected... The "new" actually means "newer than > > > > Linux 0.9"! :( > > > > > > > > Let's not use "new" in future syscall names... > > > > > > Right, we definitely can't ever succeed. On some architectures > > > we even had "oldstat" and "stat" before "newstat" and "stat64", > > > and on some architectures we mix them up. E.g. x86_64 has fstat() > > > and fstatat64() with the same structure but doesn't define > > > __NR_newfstat. On mips64, there is a 'newstat' but it has 32-bit > > > timestamps unlike all other 64-bit architectures. > > > > > > statx() was intended to solve these problems once and for all, > > > and it appears that we have failed again. > > > > New apis don't invalidate old apis necessarily. That's just not going to > > work in an age where you have containerized workloads. > > > > statx() is just the beginning of this. A container may have aritrary > > seccomp profiles that return ENOSYS or even EPERM for whatever reason > > for any new api that exists. So not implementing fstat() might already > > break container workloads. > > > > Another example: You can't just skip on implementing mount() and only > > implement the new mount api for example. Because tools that look for api > > simplicity and don't need complex setup will _always_ continue to use > > mount() and have a right to do so. > > > > And fwiw, mount() isn't fully inspectable by seccomp since forever. The > > list goes on and on. > > > > But let's look at the original mail. Why are they denying statx() and > > what's that claim about it not being able to be rewritten to something > > safe? Looking at: > > > > intptr_t SIGSYSFstatatHandler(const struct arch_seccomp_data& args, > > void* fs_denied_errno) { > > if (args.nr == __NR_fstatat_default) { > > if (*reinterpret_cast<const char*>(args.args[1]) == '\0' && > > args.args[3] == static_cast<uint64_t>(AT_EMPTY_PATH)) { > > return syscall(__NR_fstat_default, static_cast<int>(args.args[0]), > > reinterpret_cast<default_stat_struct*>(args.args[2])); > > } > > return -reinterpret_cast<intptr_t>(fs_denied_errno); > > } > > > > What this does it to rewrite fstatat() to fstat() if it was made with > > AT_EMPTY_PATH and the path argument was "". That is easily doable for > > statx() because it has the exact same AT_EMPTY_PATH semantics that > > fstatat() has. > > > > Plus, they can even filter on mask and rewrite that to something that > > they think is safe. For example, STATX_BASIC_STATS which is equivalent > > to what any fstat() call returns. So it's pretty difficult to understand > > what their actual gripe with statx() is. > > > > It can't be that statx() passes a struct because fstatat() and fstat() > > do that too. So what exactly is that problem there? > > From our investigation: > > For (new)fstatat calls that the sandboxed process may make, this SIGSYS > handler either: > > * turns allowed calls (those looking at fd's) into fstat's that only have > one argument (the fd) each, or > * denies the call, Yes, but look at the filtering that they do: if (args.nr == __NR_fstatat_default) { if (*reinterpret_cast<const char*>(args.args[1]) == '\0' && args.args[3] == static_cast<uint64_t>(AT_EMPTY_PATH)) { So if you have a statx() call instead of an fstatat() call this is trivially: if (args.nr == __NR_statx) { if (*reinterpret_cast<const char*>(args.args[1]) == '\0' && args.args[2] == static_cast<uint64_t>(AT_EMPTY_PATH)) { maybe if they care about it also simply check args.args[3] == STATX_BASIC_STATS. And then just as with fstatat() rewrite it to fstat(). > > so the sandbox only ever sees fstat calls and no (new)fstatat's, and the > guarantee that only open fds can ever been stat'ed trivially holds. > > With statx, however, there's no way of guaranteeing "only look at fd" > semantics without peeking into the path argument, because a non-empty path > makes AT_EMPTY_PATH ineffective, and the flags are not validated prior to > use making it near-impossible to introduce new semantics in a > backwards-compatible manner. I don't understand. That's exactly the same thing as for fstatat(). My point is that you can turn statx() into fstat() just like you can turn fstatat() into fstat(). So if you add fstat()/fstat64() what's left to do? > > What this tells me without knowing the exact reason is that they thought > > "Oh, if we just return ENOSYS then the workload or glibc will just > > always be able to fallback to fstat() or fstatat()". Which ultimately is > > the exact same thing that containers often assume. > > > > So really, just skipping on various system calls isn't going to work. > > You can't just implement new system calls and forget about the rest > > unless you know exactly what workloads your architecure will run on. > > > > Please implement fstat() or fstatat() and stop inventing hacks for > > statx() to make weird sandboxing rules work, please. > > We have already provided fstat(at) on LoongArch for a while by > unconditionally doing statx and translating the returned structure -- see > the [glibc] and [golang] [golang-2] implementations for example -- without But you're doing that translation in userspace. I was talking about adding the fstat()/fstat64() system calls.