On Thu, Oct 3, 2024 at 1:04 PM Stephen Smalley <stephen.smalley.work@xxxxxxxxx> wrote: > Based on our discussion at the last project meeting, I removed the > requirement to unshare the network namespace when unsharing the > SELinux namespace by adding a check in selnl_notify() to only send the > SELinux netlink notifications to the init network namespace if the > triggering process is in the init SELinux namespace. Hence, the > creator of a child SELinux namespace can either choose to unshare the > network namespace if they want to receive such netlink notifications > (in which case they will be sent to that child network namespace > only), or they can just use the SELinux status page exported by > /sys/fs/selinux/status, which is the default in libselinux for kernels > that support it. > > With that change, I can now run all of the selinux-testsuite tests > successfully from a child SELinux namespace except for two labeled > IPSEC tests each for inet_socket/tcp, inet_socket/udp, and > inet_socket/mptcp. To fully pass the other tests, I had to also put > the parent namespace into permissive mode to avoid certain failures > due to MCS constraints in the base policy that can't be overridden via > the test policy. The remaining labeled IPSEC test failures are likely > due to the fact that the xfrm hooks are not passed a sock structure or > anything else from which I can obtain the appropriate SELinux > namespace to use so they are hardcoded to use the init SELinux > namespace and even when it is permissive, there are hardcoded SID > comparisons in those hooks that are likely failing. > > I also introduced configurable limits for the maximum number of > SELinux namespaces and for the maximum depth to which they can be > nested. The default values of each can be controlled via Kconfig > options, which default to 65535 and 32 respectively (matching user > namespaces), and can be further adjusted via /sys/fs/selinux/maxns and > /sys/fs/selinux/maxnsdepth respectively but only from the init SELinux > namespace (child namespaces can read but not modify them). A simple > pair of test scripts to recursively create SELinux namespaces > correctly failed when it hit the maxnsdepth and lowering the maxns > value correctly prevented exceeding that number of total namespaces. > These tests however exposed a couple of reference counting bugs in the > code (one for SELinux namespaces, one for the parent cred that we > cache in the task security blob for use in checks on the parent > namespace), which are now also fixed. > > I have completed converting all of the permission checks to use the > namespace-aware helpers or annotated them with comments indicating > when it is correct to only check against the current SELinux > namespace. For some of the checks, it is debatable as to which helper > should be used, so we may need to revisit some of these based on > experience. > > What remains to be done: > 1. Maybe rework how policy capabilities are being checked/used to > correctly support child namespaces with different policy capabilities > from the parent. I can do this for some simple cases by lifting the > logic to walk the namespaces up into the hook function itself and > checking the policy capability value in each namespace, but many > (most?) of the policy capabilities don't lend themselves to this > approach. For example, extended_socket_class enables finer-grained > socket security classes, but this is checked and applied when the > socket security blob is initialized, not at permission check time. > Unless we want to move the mapping logic to every permission check, I > am not sure what can be done there. Similarly, a number of policy > capabilities control labeling behaviors rather than permission checks, > and since we are no longer trying to support per-namespace object > SIDs/contexts, only one namespace's policy can be applied that label > will then be used for all subsequent checks even in the other > namespaces. > > 2. Decide if any further hardening of selinuxfs is required to safely > permit usage by potentially untrusted / less trusted processes in > child namespaces. There has already been a lot of work to harden e.g. > the policy loading logic against ill-formed policies and such, so not > sure if there is much to do here, but noting it. I would like to get > rid of /sys/fs/selinux/user altogether so possibly making it > inaccessible in child namespaces would be a good first step. > > 3. Optimize the implementation for the single SELinux namespace case, > reducing and/or eliminating the overhead introduced by the SELinux > namespace support for that common case. Lots of work to do here, help > welcome. Also would appreciate guidance on current benchmarking > practices since it has been a while since I've had to do that. > > 4. Revisit the userspace API for unsharing the SELinux namespace > if/when the rest is ready. Currently just "echo 1 > > /sys/fs/selinux/unshare" (followed by the other necessary steps for > unsharing the mount namespace, unmounting the parent's selinuxfs, > mounting a new selinuxfs for the child, loading a policy, and setting > enforcing mode). Options would include adding a CLONE_SECURITY flag to > unshare/clone that could be implemented by any/all LSMs via a call to > a new (stacked) LSM hook function, or one or more new LSM system calls > to do the same, or just keeping it the way it is via selinuxfs. > > Experimentation is welcome, particularly for more complex cases, e.g. > where the host policy and the child policy differ (no policy loaded on > host, policy in child; policy loaded on host, no policy in child; host > policy from one distribution/release; child from another, etc). Be > aware however that since the permission checks are applied to the > current namespace and its ancestors, the parent namespace may deny > something that would be allowed in the child, especially if the child > is using contexts that are unknown to the parent's policy (which will > be treated as unlabeled for those checks in the parent). Also be aware > that since we are no longer trying to support per-namespace object > SIDs/contexts, any object first instantiated in the parent namespace > will be labeled according to its policy, not the child's policy. > > The tree can be found at: > https://github.com/stephensmalley/selinux-kernel/tree/working-selinuxns > > It may be re-based or changed at any time. > To experiment, after building and booting this kernel, do the following: > # Create root shell > sudo bash > # Unshare SELinux namespace > echo 1 > /sys/fs/selinux/unshare > id # Context is now "init" or "kernel" in child; ps -eZ from parent > will still show original context > # Unshare mount namespace and mount new selinuxfs for child SELinux namespace > unshare -m > umount /sys/fs/selinux > mount -t selinuxfs none /sys/fs/selinux > # Load a policy into the child SELinux namespace, parent unaffected > load_policy > id # Context is now kernel_generic_helper_t on Fedora due to a default > transition in its policy > # Switch to a suitable security context before trying to go enforcing > runcon unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 /bin/bash > # Switch child to enforcing, checking that you didn't get killed once enforcing > echo $$ > setenforce 1 > echo $$ > # Do stuff in child, run testsuite (switch parent to permissive first > to avoid denials from it), etc. Oops, I see that the selinux tree re-based to 6.12-rc1, so now updating my branch to that. There are conflicts so it may take a little bit.