> > > > > * vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls > > > > > because it will make it far to difficult to ever migrate. Instead it > > > > > should imply MFD_EXEC. > > > > > > > > > Though the purpose of memfd_noexec=2 is not to help with migration - > > > > but to disable creation of executable memfd for the current system/pid > > > > namespace. > > > > During the migration, vm.memfd_noexe = 1 helps overwriting for > > > > unmigrated user code as a temporary measure. > > > > > > My point is that the current behaviour for =2 means that nobody other > > > than *maybe* ChromeOS will ever be able to use it because it requires > > > auditing every program on the system. In fact, it's possible even > > > ChromeOS will run into issues given that one of the arguments made for > > > the nosymfollow mount option was that auditing all of ChromeOS to > > > replace every open with RESOLVE_NO_SYMLINKS would be too much effort[1] > > > (which I agreed with). Maybe this is less of an issue with > > > memfd_create(2) (which is much newer than open(2)) but it still seems > > > like a lot of busy work when the =1 behaviour is entirely sane even in > > > the strict threat model that =2 is trying to protect against. > > > > > It can also be a container (that have all memfd_create migrated to new API) > > If ChromeOS would struggle to rewrite all of the libraries they use, > containers are in even worse shape -- most container users don't have a > complete list of every package installed in a container, let alone the > ability to audit whether they pass a (no-op) flag to memfd_create(2) in > every codepath. > > > One option I considered previously was "=2" would do overwrite+block , > > and "=3" just block. But then I worry that applications won't have > > motivation to ever change their existing code, the setting will > > forever stay at "=2", making "=3" even more impossible to ever be used > > system side. > > What is the downside of overwriting? Backwards-compatibility is a very > important part of Linux -- being able to use old programs without having > to modify them is incredibly important. Yes, this behaviour is opt-in -- > but I don't see the point of making opting in more difficult than > necessary. Surely overwite+block provides the security guarantee you > need from the threat model -- othewise nobody will be able to use block > because you never know if one library will call memfd_create() > "incorrectly" without the new flags. > > > > > If you want to block syscalls that don't explicitly pass NOEXEC_SEAL, > > > there are several tools for doing this (both seccomp and LSM hooks). > > > > > > [1]: https://lore.kernel.org/linux-fsdevel/20200131212021.GA108613@xxxxxxxxxx/ > > > > > > > Additional functionality/features should be implemented through > > > > security hook and LSM, not sysctl, I think. > > > > > > This issue with =2 cannot be fixed in an LSM. (On the other hand, you > > > could implement either =2 behaviour with an LSM using =1, and the > > > current strict =2 behaviour could be implemented purely with seccomp.) > > > > > By migration, I mean a system that is not fully migrated, such a > > system should just use "=0" or "=1". Additional features can be > > implemented in SELinux/Landlock/other LSM by a motivated dev. e.g. if > > a system wants to limit executable memfd to specific programs or fully > > disable it. > > "=2" is for a system/container that is fully migrated, in that case, > > SELinux/Landlock/LSM can do the same, but sysctl provides a convenient > > alternative. > > Yes, seccomp provides a similar mechanism. Indeed, combining "=1" and > > seccomp (block MFD_EXEC), it will overwrite + block X mfd, which is > > essentially what you want, iiuc.However, I do not wish to have this > > implemented in kernel, due to the thinking that I want kernel to get > > out of business of "overwriting" eventually. > > See my above comments -- "overwriting" is perfectly acceptable to me. > There's also no way to "get out of the business of overwriting" -- Linux > has strict backwards compatibility requirements. > I agree, if we weigh on the short term goal of letting the user space applications to do minimum, then having 4 state sysctl (or 2 sysctl, one controls overwrite, one disable/enable executable memfd) will do. But with that approach, I'm afraid a version of the future (say in 20 years), most applications stays with memfd_create with the old API style, not setting the NX bit. With the current approach, it might seem to be less convenient, but I hope it offers a bit of incentive to make applications migrating their code towards the new API, explicitly setting the NX bit. I understand this hope is questionable, we might still end up the same in 20 years, but at least I tried :-). I will leave this decision to maintainers when you supply patches for that, and I wouldn't feel bad either way, there is a valid reason on both sides. To supplement, there are two other ways for what you want: 1> seccomp to block MFD_EXEC, and leaving the setting to 1. 2> implement the blocking using a security hook and LSM, imo, which is probably the most common way to deal with this type of request (block something). I admit those two ways will be less convenient than just having sysctl do all the things, from the user space's perspective. Thanks -Jeff