On 2023-08-02, Jeff Xu <jeffxu@xxxxxxxxxxxx> wrote: > > > > > > * vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls > > > > > > because it will make it far to difficult to ever migrate. Instead it > > > > > > should imply MFD_EXEC. > > > > > > > > > > > Though the purpose of memfd_noexec=2 is not to help with migration - > > > > > but to disable creation of executable memfd for the current system/pid > > > > > namespace. > > > > > During the migration, vm.memfd_noexe = 1 helps overwriting for > > > > > unmigrated user code as a temporary measure. > > > > > > > > My point is that the current behaviour for =2 means that nobody other > > > > than *maybe* ChromeOS will ever be able to use it because it requires > > > > auditing every program on the system. In fact, it's possible even > > > > ChromeOS will run into issues given that one of the arguments made for > > > > the nosymfollow mount option was that auditing all of ChromeOS to > > > > replace every open with RESOLVE_NO_SYMLINKS would be too much effort[1] > > > > (which I agreed with). Maybe this is less of an issue with > > > > memfd_create(2) (which is much newer than open(2)) but it still seems > > > > like a lot of busy work when the =1 behaviour is entirely sane even in > > > > the strict threat model that =2 is trying to protect against. > > > > > > > It can also be a container (that have all memfd_create migrated to new API) > > > > If ChromeOS would struggle to rewrite all of the libraries they use, > > containers are in even worse shape -- most container users don't have a > > complete list of every package installed in a container, let alone the > > ability to audit whether they pass a (no-op) flag to memfd_create(2) in > > every codepath. > > > > > One option I considered previously was "=2" would do overwrite+block , > > > and "=3" just block. But then I worry that applications won't have > > > motivation to ever change their existing code, the setting will > > > forever stay at "=2", making "=3" even more impossible to ever be used > > > system side. > > > > What is the downside of overwriting? Backwards-compatibility is a very > > important part of Linux -- being able to use old programs without having > > to modify them is incredibly important. Yes, this behaviour is opt-in -- > > but I don't see the point of making opting in more difficult than > > necessary. Surely overwite+block provides the security guarantee you > > need from the threat model -- othewise nobody will be able to use block > > because you never know if one library will call memfd_create() > > "incorrectly" without the new flags. > > > > > > > > If you want to block syscalls that don't explicitly pass NOEXEC_SEAL, > > > > there are several tools for doing this (both seccomp and LSM hooks). > > > > > > > > [1]: https://lore.kernel.org/linux-fsdevel/20200131212021.GA108613@xxxxxxxxxx/ > > > > > > > > > Additional functionality/features should be implemented through > > > > > security hook and LSM, not sysctl, I think. > > > > > > > > This issue with =2 cannot be fixed in an LSM. (On the other hand, you > > > > could implement either =2 behaviour with an LSM using =1, and the > > > > current strict =2 behaviour could be implemented purely with seccomp.) > > > > > > > By migration, I mean a system that is not fully migrated, such a > > > system should just use "=0" or "=1". Additional features can be > > > implemented in SELinux/Landlock/other LSM by a motivated dev. e.g. if > > > a system wants to limit executable memfd to specific programs or fully > > > disable it. > > > "=2" is for a system/container that is fully migrated, in that case, > > > SELinux/Landlock/LSM can do the same, but sysctl provides a convenient > > > alternative. > > > Yes, seccomp provides a similar mechanism. Indeed, combining "=1" and > > > seccomp (block MFD_EXEC), it will overwrite + block X mfd, which is > > > essentially what you want, iiuc.However, I do not wish to have this > > > implemented in kernel, due to the thinking that I want kernel to get > > > out of business of "overwriting" eventually. > > > > See my above comments -- "overwriting" is perfectly acceptable to me. > > There's also no way to "get out of the business of overwriting" -- Linux > > has strict backwards compatibility requirements. > > > > I agree, if we weigh on the short term goal of letting the user space > applications to do minimum, then having 4 state sysctl (or 2 sysctl, > one controls overwrite, one disable/enable executable memfd) will do. > But with that approach, I'm afraid a version of the future (say in 20 > years), most applications stays with memfd_create with the old API > style, not setting the NX bit. With the current approach, it might seem > to be less convenient, but I hope it offers a bit of incentive to make > applications migrating their code towards the new API, explicitly > setting the NX bit. I understand this hope is questionable, we might > still end up the same in 20 years, but at least I tried :-). I will > leave this decision to maintainers when you supply patches for that, > and I wouldn't feel bad either way, there is a valid reason on both sides. People will not switch =2 on if it has the possibility of breaking existing programs that are doing nothing wrong by not passing a noop flag. In 20 years at best you would have =1 in widespread use because the rewriting behaviour is what users expect of kernel uAPIs. They expect old programs to work without modifying them if they aren't doing anything wrong. A uAPI knob that requires every userspace program to change before you can safely enable it (especially because it ratchets in a way that makes it dangerous to enable on production machines) will simply not be used. If the goal is to get programs to update (which it seems it is), having a knob that nobody will turn on doesn't help. Doing proper warning logging is the way to get userspace to switch -- userspace usually notices when their programs trigger warnings in dmesg. > To supplement, there are two other ways for what you want: > 1> seccomp to block MFD_EXEC, and leaving the setting to 1. I made this point in an earlier mail. However my point is that =2 is not an acceptable uAPI and if you want something that looks like =2 you can also implement that with seccomp too! In fact, the key difference is that you cannot implement the rewriting easily in seccomp -- you would need to install a seccomp_notify monitor that does nothing but rewrite syscall arguments. This would be equivalent to running the entire system under GDB to work around a uAPI flaw. > 2> implement the blocking using a security hook and LSM, imo, which is > probably the most common way to deal with this type of request (block > something). The issue is not the blocking, it's the rewriting. -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/>
Attachment:
signature.asc
Description: PGP signature