On Wed, Aug 17, 2022 at 04:24:28PM -0500, Eric W. Biederman wrote: > Paul Moore <paul@xxxxxxxxxxxxxx> writes: > > > On Wed, Aug 17, 2022 at 4:56 PM Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote: > >> Paul Moore <paul@xxxxxxxxxxxxxx> writes: > >> > On Wed, Aug 17, 2022 at 3:58 PM Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote: > >> >> Paul Moore <paul@xxxxxxxxxxxxxx> writes: > >> >> > >> >> > At the end of the v4 patchset I suggested merging this into lsm/next > >> >> > so it could get a full -rc cycle in linux-next, assuming no issues > >> >> > were uncovered during testing > >> >> > >> >> What in the world can be uncovered in linux-next for code that has no in > >> >> tree users. > >> > > >> > The patchset provides both BPF LSM and SELinux implementations of the > >> > hooks along with a BPF LSM test under tools/testing/selftests/bpf/. > >> > If no one beats me to it, I plan to work on adding a test to the > >> > selinux-testsuite as soon as I'm done dealing with other urgent > >> > LSM/SELinux issues (io_uring CMD passthrough, SCTP problems, etc.); I > >> > run these tests multiple times a week (multiple times a day sometimes) > >> > against the -rcX kernels with the lsm/next, selinux/next, and > >> > audit/next branches applied on top. I know others do similar things. > >> > >> A layer of hooks that leaves all of the logic to userspace is not an > >> in-tree user for purposes of understanding the logic of the code. > > > > The BPF LSM selftests which are part of this patchset live in-tree. > > The SELinux hook implementation is completely in-tree with the > > subject/verb/object relationship clearly described by the code itself. > > After all, the selinux_userns_create() function consists of only two > > lines, one of which is an assignment. Yes, it is true that the > > SELinux policy lives outside the kernel, but that is because there is > > no singular SELinux policy for everyone. From a practical > > perspective, the SELinux policy is really just a configuration file > > used to setup the kernel at runtime; it is not significantly different > > than an iptables script, /etc/sysctl.conf, or any of the other myriad > > of configuration files used to configure the kernel during boot. > > I object to adding the new system configuration knob. I do strongly sympathize with Eric's points. It will be very easy, once user namespace creation has been further restricted in some distros, to say "well see this stuff is silly" and go back to simply requiring root to create all containers and namespaces, which is generally quite a bit easier anywway. And then, of course, give everyone root so they can start containers. As Eric said, | Further adding a random failure mode to user namespace creation if it is | used at all will just encourage userspace to use a setuid application to | perform the namespace creation instead. Creating a less secure system | overall. However, I'm also looking at e.g. CVE-2022-2588 and CVE-2022-2586, and yes there are two issues which do require discussion (three if you count reportability, which is mainly a tool in guarding against the others). The first is, indeed, configuration knobs. There are tools, including chrome, which use user namespaces to make things better. The hope is that more and more tools will do so. The second is damage control. When an 0day has been announced, things change. You can say "well the bug was there all along", but it is different when every lazy ne'erdowell can pick an exploit off a mailing list and use it against a product for which spinning a new version with a new kernel and getting customers to update is probably a months-long endeavor. Some of these products do in fact require namespaces (user and otherwise) as part of their function. And - to my chagrin - I suspect most of them create usernamespace as the root user, before possibly processing untrusted user input, so unprivileged_userns_clone isn't a good fit. SELinux (and LSMs in generaly) do in fact seem like a useful place to add some configuration, because they tend to assign different domains to tasks with different purposes and trust levels. But another such place is the init system / service manager. And in most cases these days, this will use cgroups to collect tasks of certain types. So I wonder (this is ALMOST ENTIRELY thinking out loud, not thought through sufficiently) whether we should be setting a cgroup.nslock or somesuch. Of course, kernel livepatch is another potentially useful mitigation. Currently that's not possible for everyone. Maybe there is a more fundamental way we can approach this. Part of me still likes the idea of splitting the id mapping and capability-in-userns parts, but that's not sufficient. Maybe looking over all the relevant CVEs would give a better hint. Eric, you said | If the concern is to reduce the attack surface everything this | proposed hook can do is already possible with the security_capable | security hook. I suppose I could envision an LSM which gets activated when we find out there was a net-ns-exacerbated 0-day, which refuses CAP_NET_ADMIN for a task not in init_user_ns? Ideally it would be more flexible than that. > idea. What is userspace going to do with this new feature that makes it > worth maintaining in the kernel? > > That is always the conversation we have when adding new features, and > that is exactly the conversation that has not happened here. Eric and Paul, I wonder, will you - or some people you'd like to represent you - be at plumbers in September? Should there be a BOF session there? (I won't be there, but could join over video) I think a brainstorming session for solutions to the above problems would be good. > Adding a layer of indirection should not exempt a new feature from > needing to justify itself. > > Eric