On Mon, Oct 7, 2024 at 4:01 PM Stephen Smalley <stephen.smalley.work@xxxxxxxxx> wrote: > Re-based again on latest selinux/dev, and pushed a few more changes: > - selinux: make open_perms namespace-aware; demonstrates how to > integrate the namespace-based checking with the open_perms policy > capability so that we only check file open permission in namespaces > that enable the capability in their policy. That was the easy case; > the rest of the policy capabilities are less clear on how to resolve > as per my earlier description. > - selinux: split cred_ssid_has_perm() into two cases; alters how the > namespace-based checking is applied to socket and SysV IPC checks > based on some testing > - selinux: set initial SID context for init to "kernel" in global SID > table; fixes what would be a userspace compatibility problem for init > in child namespaces by duplicating what we were already doing in the > security server's per-policy SID table. > > In addition to running the testsuite in a child SELinux namespace as > before, I also launched a RHEL9 UBI container, manually unshared the > SELinux namespace in a shell within it, and loaded the RHEL9 policy > into the child namespace to simulate what we would ultimately want the > container runtime to do. That motivated the latter two changes above. > Not quite ready to put that into enforcing mode due to some denials > between container processes and host resources (like the inherited > open fds) but a step forward. > I think to better understand what else is needed, it would help to > prototype modifications to a container runtime to unshare the SELinux > namespace (just using the current /sys/fs/selinux/unshare interface > for now) before launching the container init, and see if the container > init does the right thing (i.e. concludes that SELinux doesn't yet > have a policy loaded based on /proc/self/attr/current=="kernel", > mounts its own /sys/fs/selinux private to its namespace, and loads its > policy into it). Thoughts on what the easiest container runtime to > patch in this way would be, or anyone want to try? Re-based again on top of latest selinux/dev to resolve the conflicts with the just-merged patches and to update the new netlink xperm support for SELinux namespaces. Passes the selinux-testsuite including the (not yet merged) nlmsg tests in both the init SELinux namespace and a child SELinux namespace (modulo the labeled IPSEC tests and with the init SELinux namespace permissive for testing the child or modifying the init namespace policy to permit it to run all the tests in the child context). Functionally, this is nearly complete as far as SELinux-only changes go (not including the corresponding work needed to namespace audit and if desired/necessary, to allow namespacing of the labeled IPSEC hooks), modulo any bugs that get discovered in trying to create real containers with their own SELinux namespaces and different combinations of policies between the host OS and the containers. My remaining ToDo list is as follows, but this is a good point for others to provide feedback or experiment with the functionality or propose their favorite container runtime for the next stages of prototyping. If it would help spark feedback, I could post the current set of kernel patches to the list. - Test creation/use of SELinux namespaces from actual containers with different policies from the host OS. This may require patching a container runtime to add support for unsharing the SELinux namespace and unmounting the old selinuxfs prior to starting the container init. Combinations to test: No policy loaded on host, policy loaded in container e.g. Fedora on Ubuntu; host with newer base policy than container e.g. RHEL/Rocky 8/9 on Fedora; container with newer base policy than host e.g. Fedora on RHEL/Rocky 8/9; host and container with different upstream policy sources e.g. Ubuntu on Fedora; Android container on Linux host OS. - Rework how policy capabilities are being checked/used to correctly support child namespaces with different policy capabilities from the parent. This has already been done for the open_perms capability by lifting the logic to walk the namespaces up into the hook function itself and checking the policy capability value in each namespace, but many (most?) of the policy capabilities don't lend themselves to this approach. For example, extended_socket_class enables finer-grained socket security classes, but this is checked and applied when the socket security blob is initialized, not at permission check time. Unless we want to move the mapping logic to every permission check, I am not sure what can be done there. One option would be to force-enable the same policy capabilities in the child namespace as in the parent to avoid conflicts but this would limit the ability to use differing policies. Similarly, a number of policy capabilities control labeling behaviors rather than permission checks, and since we are no longer trying to support per-namespace object SIDs/contexts, only one namespace's policy can be applied that label will then be used for all subsequent checks even in the other namespaces. - Decide if any further hardening of selinuxfs is required to safely permit usage by potentially untrusted / less trusted processes in child namespaces. There has already been a lot of work to harden e.g. the policy loading logic against ill-formed policies and such, so not sure if there is much to do here, but noting it as a possible area to audit for safety. - Ensure that we are correctly handling peer and packet labels when they cross SELinux namespaces, for some definition thereof, both wrt permission checking and wrt the peer/packet labels that are exposed to userspace via getsockopt(SO_PEERSEC), recvmsg() SCM_SECURITY, etc. - Optimize the implementation for the single SELinux namespace case, reducing and/or eliminating the overhead introduced by the SELinux namespace support for that common case. Lots of work to do here, help welcome. Also would appreciate guidance on current Linux kernel benchmarking best practices since it has been a while since I've had to do that. - Re-test with KASAN and with KCSAN enabled to confirm that the namespace patches haven't introduced any memory errors or race conditions; I have tested with each of these in the past successfully but don't keep them enabled generally because they make everything very slow. And you can't have them both enabled together at runtime AFAICT. - Revisit the userspace API for unsharing the SELinux namespace if/when the rest is ready. Currently just "echo 1 > /sys/fs/selinux/unshare" (followed by the other necessary steps for unsharing the mount namespace, unmounting the parent's selinuxfs, mounting a new selinuxfs for the child, loading a policy, and setting enforcing mode). Options would include adding a CLONE_SECURITY flag to unshare/clone that could be implemented by any/all LSMs via a call to a new (stacked) LSM hook function, or one or more new LSM system calls to do the same, or just keeping it the way it is via selinuxfs. - Upstream the kernel support. - Figure out how to combine the use of SELinux namespaces with Red Hat's current model of isolating and confining containers as a whole via SELinux on the host OS. This is complicated by the fact that we are only supporting a single inode SID/context per inode (gave up on per-namespace inode SID/contexts, see the earlier mailing list discussions), and Red Hat's current model uses context mounts to assign a single security context to all the inodes used by the container. Possibly introduce a new kind of context= mount that is namespace-aware, i.e. only apply the context mount when in the outer namespace but use the inode xattrs inside the child namespace. - Integrate and upstream userspace support into appropriate container runtimes, e.g. systemd-nspawn, crun/runc, podman, docker, k8s, etc.