Re: [PATCH v2 4/5] memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy

Jeff Xu <jeffxu@xxxxxxxxxx> · Wed, 16 Aug 2023 15:46:41 -0700

On Tue, Aug 15, 2023 at 10:44 PM Dominique Martinet
<asmadeus@xxxxxxxxxxxxx> wrote:
>
> Jeff Xu wrote on Tue, Aug 15, 2023 at 10:13:18PM -0700:
> > > Given that it is possible for CAP_SYS_ADMIN users to create executable
> > > binaries without memfd_create(2) and without touching the host
> > > filesystem (not to mention the many other things a CAP_SYS_ADMIN process
> > > would be able to do that would be equivalent or worse), it seems strange
> > > to cause a fair amount of headache to admins when there doesn't appear
> > > to be an actual security benefit to blocking this. There appear to be
> > > concerns about confused-deputy-esque attacks[2] but a confused deputy that
> > > can write to arbitrary sysctls is a bigger security issue than
> > > executable memfds.
> > >
> > Something to point out: The demo code might be enough to prove your
> > case in other distributions, however, in ChromeOS, you can't run this
> > code. The executable in ChromeOS are all from known sources and
> > verified at boot.
> > If an attacker could run this code in ChromeOS, that means the
> > attacker already acquired arbitrary code execution through other ways,
> > at that point, the attacker no longer needs to create/find an
> > executable memfd, they already have the vehicle. You can't use an
> > example of an attacker already running arbitrary code to prove that
> > disable downgrading is useless.
> > I agree it is a big problem that an attacker already can modify a
> > sysctl.  Assuming this can happen by controlling arguments passed into
> > sysctl, at the time, the attacker might not have full arbitrary code
> > execution yet, that is the reason the original design is so
> > restrictive.
>
> I don't understand how you can say an attacker cannot run arbitrary code
> within a process here, yet assert that they'd somehow run memfd_create +
> execveat on it if this sysctl is lowered -- the two look equivalent to
> me?
>
It might require multiple steps for this attack, one possible scenario:
1> control a write primitive in CAP_SYSADMIN process's memory,  change
arguments of sysctl call, and downgrade the setting for memfd, e.g. change
it=0 to revert to old behavior (by default creating executable memfd)
2> control a non-privileged process that creates and writes to
memfd, and write the contents with the binary that the
attacker wants. This process just needs non-executable memfd, but
isn't updated yet.
3> Confuse a non-privilege process to execute the memfd the attacker
wrote in step 2.

In chromeOS, because all the executables are from verified sources,
attackers typically can't easily use the step 3 alone (without step
2),  and memfd was such a hole that enables an unverified executable.

In the original design, downgrading is not allowed, the attack chain
of 2/3 is completely blocked.  With this new approach, attackers will
try to find an additional step (step 1) to make the old attack (step 2
and 3) working again. It is difficult but I can't say it is
impossible.

> CAP_SYS_ADMIN is a kludge of a capability that pretty much gives root as
> soon as you can run arbitrary code (just have a look at the various
> container escape example when the capability is given); I see little
> point in trying to harden just this here.

I'm not an expert in containers, if the industry is giving up on
privileged containers, then the reasoning makes sense.
>From ChromeOS point of view, we don't use runc currently, so I think
it makes more sense for runc users to drive these features.  The
original design is with runc's in mind, and even privileged containers
can't downgrade its own setting.

> It'd make more sense to limit all sysctl modifications in the context
> you're thinking of through e.g. selinux or another LSM.
>
I agree,  when I think more about this.
Security features fit LSM better, LSM can do additional "allow/deny"
on otherwise allowed behavior from user space code. Based on that,
"disallow downgrading" fits LSM better.  Also from the same reasoning,
I have second thoughts on the "=2", originally the "MEMFD_EXE was left
out due to the thinking, if user code explicitly setting MEMFD_EXE,
sysctl should not block it, it is the work of LSM. However, the "=2"
has evolved to block MEMFD_EXE completely ... alas .. it might be too
late to revert this, if this is what devs want, it can be that way.

Thanks
Best regards,
-Jeff

-Jeff

> (in the context of users making their own containers, my suggestion is
> always to never use CAP_SYS_ADMIN, or if they must give it to a separate
> minimal container where they can limit user interaction)
>
>
> FWIW, I also think the proposed =2 behaviour makes more sense, but this
> is something we already discussed last month so I won't come back to it
> as not really involved here.
>
> --
> Dominique Martinet | Asmadeus