On Sat, Aug 14, 2021 at 01:57:31AM +0000, Al Viro wrote: > On Fri, Aug 13, 2021 at 02:58:57PM -1000, Linus Torvalds wrote: > > On Fri, Aug 13, 2021 at 2:54 PM Linus Torvalds > > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > > > And nobody really complained when we weakened it, so maybe removing it > > > entirely might be acceptable. > > > > I guess we could just try it and see... Worst comes to worst, we'll > > have to put it back, but at least we'd know what crazy thing still > > wants it.. > > Umm... I'll need to go back and look through the thread, but I'm > fairly sure that there used to be suckers that did replacement of > binary that way (try to write, count on exclusion with execve while > it's being written to) instead of using rename. Install scripts > of weird crap and stuff like that... I'm not agains trying to remove it, but I think Al has a point. Removing the write protection will also most certainly make certain classes of attacks _easier_. For example, the runC container breakout from last year using privileged containers issued CVE-2019-5736 would be easier. I'm quoting from the commit I fixed this with: The attack can be made when attaching to a running container or when starting a container running a specially crafted image. For example, when runC attaches to a container the attacker can trick it into executing itself. This could be done by replacing the target binary inside the container with a custom binary pointing back at the runC binary itself. As an example, if the target binary was /bin/bash, this could be replaced with an executable script specifying the interpreter path #!/proc/self/exe (/proc/self/exec is a symbolic link created by the kernel for every process which points to the binary that was executed for that process). As such when /bin/bash is executed inside the container, instead the target of /proc/self/exe will be executed - which will point to the runc binary on the host. The attacker can then proceed to write to the target of /proc/self/exe to try and overwrite the runC binary on the host. and then the write protection kicks in of course: However in general, this will not succeed as the kernel will not permit it to be overwritten whilst runC is executing. which the attack can of course already overcome nowadays with minimal smarts: To overcome this, the attacker can instead open a file descriptor to /proc/self/exe using the O_PATH flag and then proceed to reopen the binary as O_WRONLY through /proc/self/fd/<nr> and try to write to it in a busy loop from a separate process. Ultimately it will succeed when the runC binary exits. After this the runC binary is compromised and can be used to attack other containers or the host itself. But with write protection removed you'd allow such attacks to succeed right away. It's not a huge deal to remove it since we need to have other protection mechanisms in place already: To prevent this attack, LXC has been patched to create a temporary copy of the calling binary itself when it starts or attaches to containers. To do this LXC creates an anonymous, in-memory file using the memfd_create() system call and copies itself into the temporary in-memory file, which is then sealed to prevent further modifications. LXC then executes this sealed, in-memory file instead of the original on-disk binary. Any compromising write operations from a privileged container to the host LXC binary will then write to the temporary in-memory binary and not to the host binary on-disk, preserving the integrity of the host LXC binary. Also as the temporary, in-memory LXC binary is sealed, writes to this will also fail. Note: memfd_create() was added to the Linux kernel in the 3.17 release. However, I still like to pich the upgrade mask idea Aleksa and we tried to implement when we did openat2(). If we leave write-protection in preventing /proc/self/exe from being written to: we can take some time and upstream the upgrade mask patchset which was part of the initial openat2() patchset but was dropped back then (and I had Linus remove the last remants of the idea in [1]). The idea was to add a new field to struct open_how "upgrade_mask" that would allow a caller to specify with what permissions an fd could be reopened with. I still like this idea a great deal and it would be a very welcome addition to system management programs. The upgrade mask is of course optional, i.e. the caller would have to specify the upgrade mask at open time to restrict reopening (lest we regress the whole world). But, we could make it so that an O_PATH fd gotten from opening /proc/<pid>/exe always gets a restricted upgrade mask set and so it can't be upgraded to a O_WRONLY fd afterwards. For this to be meaningful, write protection for /proc/self/exe would need to be kept. [1]: commit 5c350aa11b441b32baf3bfe4018168cb8d10cef7 Author: Christian Brauner <christian.brauner@xxxxxxxxxx> Date: Fri May 28 11:24:15 2021 +0200 fcntl: remove unused VALID_UPGRADE_FLAGS We currently do not maky use of this feature and should we implement something like this in the future it's trivial to add it back. Link: https://lore.kernel.org/r/20210528092417.3942079-2-brauner@xxxxxxxxxx Cc: Christoph Hellwig <hch@xxxxxx> Cc: Aleksa Sarai <cyphar@xxxxxxxxxx> Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx> Cc: linux-fsdevel@xxxxxxxxxxxxxxx Suggested-by: Richard Guy Briggs <rgb@xxxxxxxxxx> Reviewed-by: Richard Guy Briggs <rgb@xxxxxxxxxx> Reviewed-by: Christoph Hellwig <hch@xxxxxx> Signed-off-by: Christian Brauner <christian.brauner@xxxxxxxxxx>