On Fri, Sep 27, 2024 at 10:48 AM Stephen Smalley <stephen.smalley.work@xxxxxxxxx> wrote: > Since an increasing number of the testsuite tests are failing in a > child SELinux namespace due to even unconfined_t in the parent > namespace not being allowed the requisite permissions in the parent > namespace, I've created a modified version of the testsuite policy to > allow those permissions to unconfined_t and also disabled the tests > that cannot work currently due to the separate network namespace. > Those changes are on a branch of my fork of the selinux-testsuite at: > https://github.com/stephensmalley/selinux-testsuite/tree/selinuxns > > With those changes, if I load the test policy into the parent > namespace (so that the test domains/types are defined and access is > allowed to unconfined_t) and then create a child namespace from an > unconfined_t shell and run the testsuite from it, all of the > (still-enabled) tests pass. I'll keep amending the test policy on that > branch with further changes as I convert additional permission checks > to be namespace-aware. Eventually we can figure out if it makes sense > to merge these into the main testsuite but that can wait until we're > ready to merge the kernel namespace support itself. Based on our discussion at the last project meeting, I removed the requirement to unshare the network namespace when unsharing the SELinux namespace by adding a check in selnl_notify() to only send the SELinux netlink notifications to the init network namespace if the triggering process is in the init SELinux namespace. Hence, the creator of a child SELinux namespace can either choose to unshare the network namespace if they want to receive such netlink notifications (in which case they will be sent to that child network namespace only), or they can just use the SELinux status page exported by /sys/fs/selinux/status, which is the default in libselinux for kernels that support it. With that change, I can now run all of the selinux-testsuite tests successfully from a child SELinux namespace except for two labeled IPSEC tests each for inet_socket/tcp, inet_socket/udp, and inet_socket/mptcp. To fully pass the other tests, I had to also put the parent namespace into permissive mode to avoid certain failures due to MCS constraints in the base policy that can't be overridden via the test policy. The remaining labeled IPSEC test failures are likely due to the fact that the xfrm hooks are not passed a sock structure or anything else from which I can obtain the appropriate SELinux namespace to use so they are hardcoded to use the init SELinux namespace and even when it is permissive, there are hardcoded SID comparisons in those hooks that are likely failing. I also introduced configurable limits for the maximum number of SELinux namespaces and for the maximum depth to which they can be nested. The default values of each can be controlled via Kconfig options, which default to 65535 and 32 respectively (matching user namespaces), and can be further adjusted via /sys/fs/selinux/maxns and /sys/fs/selinux/maxnsdepth respectively but only from the init SELinux namespace (child namespaces can read but not modify them). A simple pair of test scripts to recursively create SELinux namespaces correctly failed when it hit the maxnsdepth and lowering the maxns value correctly prevented exceeding that number of total namespaces. These tests however exposed a couple of reference counting bugs in the code (one for SELinux namespaces, one for the parent cred that we cache in the task security blob for use in checks on the parent namespace), which are now also fixed. I have completed converting all of the permission checks to use the namespace-aware helpers or annotated them with comments indicating when it is correct to only check against the current SELinux namespace. For some of the checks, it is debatable as to which helper should be used, so we may need to revisit some of these based on experience. What remains to be done: 1. Maybe rework how policy capabilities are being checked/used to correctly support child namespaces with different policy capabilities from the parent. I can do this for some simple cases by lifting the logic to walk the namespaces up into the hook function itself and checking the policy capability value in each namespace, but many (most?) of the policy capabilities don't lend themselves to this approach. For example, extended_socket_class enables finer-grained socket security classes, but this is checked and applied when the socket security blob is initialized, not at permission check time. Unless we want to move the mapping logic to every permission check, I am not sure what can be done there. Similarly, a number of policy capabilities control labeling behaviors rather than permission checks, and since we are no longer trying to support per-namespace object SIDs/contexts, only one namespace's policy can be applied that label will then be used for all subsequent checks even in the other namespaces. 2. Decide if any further hardening of selinuxfs is required to safely permit usage by potentially untrusted / less trusted processes in child namespaces. There has already been a lot of work to harden e.g. the policy loading logic against ill-formed policies and such, so not sure if there is much to do here, but noting it. I would like to get rid of /sys/fs/selinux/user altogether so possibly making it inaccessible in child namespaces would be a good first step. 3. Optimize the implementation for the single SELinux namespace case, reducing and/or eliminating the overhead introduced by the SELinux namespace support for that common case. Lots of work to do here, help welcome. Also would appreciate guidance on current benchmarking practices since it has been a while since I've had to do that. 4. Revisit the userspace API for unsharing the SELinux namespace if/when the rest is ready. Currently just "echo 1 > /sys/fs/selinux/unshare" (followed by the other necessary steps for unsharing the mount namespace, unmounting the parent's selinuxfs, mounting a new selinuxfs for the child, loading a policy, and setting enforcing mode). Options would include adding a CLONE_SECURITY flag to unshare/clone that could be implemented by any/all LSMs via a call to a new (stacked) LSM hook function, or one or more new LSM system calls to do the same, or just keeping it the way it is via selinuxfs. Experimentation is welcome, particularly for more complex cases, e.g. where the host policy and the child policy differ (no policy loaded on host, policy in child; policy loaded on host, no policy in child; host policy from one distribution/release; child from another, etc). Be aware however that since the permission checks are applied to the current namespace and its ancestors, the parent namespace may deny something that would be allowed in the child, especially if the child is using contexts that are unknown to the parent's policy (which will be treated as unlabeled for those checks in the parent). Also be aware that since we are no longer trying to support per-namespace object SIDs/contexts, any object first instantiated in the parent namespace will be labeled according to its policy, not the child's policy. The tree can be found at: https://github.com/stephensmalley/selinux-kernel/tree/working-selinuxns It may be re-based or changed at any time. To experiment, after building and booting this kernel, do the following: # Create root shell sudo bash # Unshare SELinux namespace echo 1 > /sys/fs/selinux/unshare id # Context is now "init" or "kernel" in child; ps -eZ from parent will still show original context # Unshare mount namespace and mount new selinuxfs for child SELinux namespace unshare -m umount /sys/fs/selinux mount -t selinuxfs none /sys/fs/selinux # Load a policy into the child SELinux namespace, parent unaffected load_policy id # Context is now kernel_generic_helper_t on Fedora due to a default transition in its policy # Switch to a suitable security context before trying to go enforcing runcon unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 /bin/bash # Switch child to enforcing, checking that you didn't get killed once enforcing echo $$ setenforce 1 echo $$ # Do stuff in child, run testsuite (switch parent to permissive first to avoid denials from it), etc.