Re: [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec

Jeff Xu <jeffxu@xxxxxxxxxxxx> · Wed, 2 Aug 2023 13:45:21 -0700

> > > > >  * vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls
> > > > >    because it will make it far to difficult to ever migrate. Instead it
> > > > >    should imply MFD_EXEC.
> > > > >
> > > > Though the purpose of memfd_noexec=2 is not to help with migration  -
> > > > but to disable creation of executable memfd for the current system/pid
> > > > namespace.
> > > > During the migration,  vm.memfd_noexe = 1 helps overwriting for
> > > > unmigrated user code as a temporary measure.
> > >
> > > My point is that the current behaviour for =2 means that nobody other
> > > than *maybe* ChromeOS will ever be able to use it because it requires
> > > auditing every program on the system. In fact, it's possible even
> > > ChromeOS will run into issues given that one of the arguments made for
> > > the nosymfollow mount option was that auditing all of ChromeOS to
> > > replace every open with RESOLVE_NO_SYMLINKS would be too much effort[1]
> > > (which I agreed with). Maybe this is less of an issue with
> > > memfd_create(2) (which is much newer than open(2)) but it still seems
> > > like a lot of busy work when the =1 behaviour is entirely sane even in
> > > the strict threat model that =2 is trying to protect against.
> > >
> > It can also be a container (that have all memfd_create migrated to new API)
>
> If ChromeOS would struggle to rewrite all of the libraries they use,
> containers are in even worse shape -- most container users don't have a
> complete list of every package installed in a container, let alone the
> ability to audit whether they pass a (no-op) flag to memfd_create(2) in
> every codepath.
>
> > One option I considered previously was "=2" would do overwrite+block ,
> > and "=3" just block. But then I worry that applications won't have
> > motivation to ever change their existing code, the setting will
> > forever stay at "=2", making "=3" even more impossible to ever be used
> >  system side.
>
> What is the downside of overwriting? Backwards-compatibility is a very
> important part of Linux -- being able to use old programs without having
> to modify them is incredibly important. Yes, this behaviour is opt-in --
> but I don't see the point of making opting in more difficult than
> necessary. Surely overwite+block provides the security guarantee you
> need from the threat model -- othewise nobody will be able to use block
> because you never know if one library will call memfd_create()
> "incorrectly" without the new flags.
>
>
> > > If you want to block syscalls that don't explicitly pass NOEXEC_SEAL,
> > > there are several tools for doing this (both seccomp and LSM hooks).
> > >
> > > [1]: https://lore.kernel.org/linux-fsdevel/20200131212021.GA108613@xxxxxxxxxx/
> > >
> > > > Additional functionality/features should be implemented through
> > > > security hook and LSM, not sysctl, I think.
> > >
> > > This issue with =2 cannot be fixed in an LSM. (On the other hand, you
> > > could implement either =2 behaviour with an LSM using =1, and the
> > > current strict =2 behaviour could be implemented purely with seccomp.)
> > >
> > By migration, I mean  a system that is not fully migrated, such a
> > system should just use "=0" or "=1". Additional features can be
> > implemented in SELinux/Landlock/other LSM by a motivated dev.  e.g. if
> > a system wants to limit executable memfd to specific programs or fully
> > disable it.
> > "=2" is for a system/container that is fully migrated, in that case,
> > SELinux/Landlock/LSM can do the same, but sysctl provides a convenient
> >  alternative.
> > Yes, seccomp provides a similar mechanism. Indeed, combining "=1" and
> > seccomp (block MFD_EXEC), it will overwrite + block X mfd, which is
> > essentially what you want, iiuc.However, I do not wish to have this
> > implemented in kernel, due to the thinking that I want kernel to get
> > out of business of "overwriting" eventually.
>
> See my above comments -- "overwriting" is perfectly acceptable to me.
> There's also no way to "get out of the business of overwriting" -- Linux
> has strict backwards compatibility requirements.
>

I agree, if we weigh on the short term goal of letting the user space
applications to do minimum, then having 4 state sysctl (or 2 sysctl,
one controls overwrite, one disable/enable executable memfd) will do.
But with that approach, I'm afraid a version of the future (say in 20
years), most applications stays with memfd_create with the old API
style, not setting the NX bit. With the current approach, it might seem
to be less convenient, but I hope it offers a bit of incentive to make
applications migrating their code towards the new API, explicitly
setting the NX bit.  I understand this hope is questionable, we might
still end up the same in 20 years, but at least I tried :-). I will
leave this decision to maintainers when you supply patches for that,
and I wouldn't feel bad either way, there is a valid reason on both sides.

To supplement, there are  two other ways for what you want:
1> seccomp to block MFD_EXEC, and leaving the setting to 1.
2> implement the blocking using a security hook and LSM, imo, which is
probably the most common way to deal with this type of request (block
something).
I admit those two ways will be less convenient than just having sysctl
do all the things, from the user space's perspective.

Thanks

-Jeff