On Tue, Aug 15, 2023 at 10:44 PM Dominique Martinet <asmadeus@xxxxxxxxxxxxx> wrote: > > Jeff Xu wrote on Tue, Aug 15, 2023 at 10:13:18PM -0700: > > > Given that it is possible for CAP_SYS_ADMIN users to create executable > > > binaries without memfd_create(2) and without touching the host > > > filesystem (not to mention the many other things a CAP_SYS_ADMIN process > > > would be able to do that would be equivalent or worse), it seems strange > > > to cause a fair amount of headache to admins when there doesn't appear > > > to be an actual security benefit to blocking this. There appear to be > > > concerns about confused-deputy-esque attacks[2] but a confused deputy that > > > can write to arbitrary sysctls is a bigger security issue than > > > executable memfds. > > > > > Something to point out: The demo code might be enough to prove your > > case in other distributions, however, in ChromeOS, you can't run this > > code. The executable in ChromeOS are all from known sources and > > verified at boot. > > If an attacker could run this code in ChromeOS, that means the > > attacker already acquired arbitrary code execution through other ways, > > at that point, the attacker no longer needs to create/find an > > executable memfd, they already have the vehicle. You can't use an > > example of an attacker already running arbitrary code to prove that > > disable downgrading is useless. > > I agree it is a big problem that an attacker already can modify a > > sysctl. Assuming this can happen by controlling arguments passed into > > sysctl, at the time, the attacker might not have full arbitrary code > > execution yet, that is the reason the original design is so > > restrictive. > > I don't understand how you can say an attacker cannot run arbitrary code > within a process here, yet assert that they'd somehow run memfd_create + > execveat on it if this sysctl is lowered -- the two look equivalent to > me? > It might require multiple steps for this attack, one possible scenario: 1> control a write primitive in CAP_SYSADMIN process's memory, change arguments of sysctl call, and downgrade the setting for memfd, e.g. change it=0 to revert to old behavior (by default creating executable memfd) 2> control a non-privileged process that creates and writes to memfd, and write the contents with the binary that the attacker wants. This process just needs non-executable memfd, but isn't updated yet. 3> Confuse a non-privilege process to execute the memfd the attacker wrote in step 2. In chromeOS, because all the executables are from verified sources, attackers typically can't easily use the step 3 alone (without step 2), and memfd was such a hole that enables an unverified executable. In the original design, downgrading is not allowed, the attack chain of 2/3 is completely blocked. With this new approach, attackers will try to find an additional step (step 1) to make the old attack (step 2 and 3) working again. It is difficult but I can't say it is impossible. > CAP_SYS_ADMIN is a kludge of a capability that pretty much gives root as > soon as you can run arbitrary code (just have a look at the various > container escape example when the capability is given); I see little > point in trying to harden just this here. I'm not an expert in containers, if the industry is giving up on privileged containers, then the reasoning makes sense. >From ChromeOS point of view, we don't use runc currently, so I think it makes more sense for runc users to drive these features. The original design is with runc's in mind, and even privileged containers can't downgrade its own setting. > It'd make more sense to limit all sysctl modifications in the context > you're thinking of through e.g. selinux or another LSM. > I agree, when I think more about this. Security features fit LSM better, LSM can do additional "allow/deny" on otherwise allowed behavior from user space code. Based on that, "disallow downgrading" fits LSM better. Also from the same reasoning, I have second thoughts on the "=2", originally the "MEMFD_EXE was left out due to the thinking, if user code explicitly setting MEMFD_EXE, sysctl should not block it, it is the work of LSM. However, the "=2" has evolved to block MEMFD_EXE completely ... alas .. it might be too late to revert this, if this is what devs want, it can be that way. Thanks Best regards, -Jeff -Jeff > (in the context of users making their own containers, my suggestion is > always to never use CAP_SYS_ADMIN, or if they must give it to a separate > minimal container where they can limit user interaction) > > > FWIW, I also think the proposed =2 behaviour makes more sense, but this > is something we already discussed last month so I won't come back to it > as not really involved here. > > -- > Dominique Martinet | Asmadeus