Re: SELinux namespaces re-base

Stephen Smalley <stephen.smalley.work@xxxxxxxxx> · Thu, 3 Oct 2024 14:29:42 -0400

On Thu, Oct 3, 2024 at 1:04 PM Stephen Smalley
<stephen.smalley.work@xxxxxxxxx> wrote:
> Based on our discussion at the last project meeting, I removed the
> requirement to unshare the network namespace when unsharing the
> SELinux namespace by adding a check in selnl_notify() to only send the
> SELinux netlink notifications to the init network namespace if the
> triggering process is in the init SELinux namespace. Hence, the
> creator of a child SELinux namespace can either choose to unshare the
> network namespace if they want to receive such netlink notifications
> (in which case they will be sent to that child network namespace
> only), or they can just use the SELinux status page exported by
> /sys/fs/selinux/status, which is the default in libselinux for kernels
> that support it.
>
> With that change, I can now run all of the selinux-testsuite tests
> successfully from a child SELinux namespace except for two labeled
> IPSEC tests each for inet_socket/tcp, inet_socket/udp, and
> inet_socket/mptcp. To fully pass the other tests, I had to also put
> the parent namespace into permissive mode to avoid certain failures
> due to MCS constraints in the base policy that can't be overridden via
> the test policy. The remaining labeled IPSEC test failures are likely
> due to the fact that the xfrm hooks are not passed a sock structure or
> anything else from which I can obtain the appropriate SELinux
> namespace to use so they are hardcoded to use the init SELinux
> namespace and even when it is permissive, there are hardcoded SID
> comparisons in those hooks that are likely failing.
>
> I also introduced configurable limits for the maximum number of
> SELinux namespaces and for the maximum depth to which they can be
> nested. The default values of each can be controlled via Kconfig
> options, which default to 65535 and 32 respectively (matching user
> namespaces), and can be further adjusted via /sys/fs/selinux/maxns and
> /sys/fs/selinux/maxnsdepth respectively but only from the init SELinux
> namespace (child namespaces can read but not modify them). A simple
> pair of test scripts to recursively create SELinux namespaces
> correctly failed when it hit the maxnsdepth and lowering the maxns
> value correctly prevented exceeding that number of total namespaces.
> These tests however exposed a couple of reference counting bugs in the
> code (one for SELinux namespaces, one for the parent cred that we
> cache in the task security blob for use in checks on the parent
> namespace), which are now also fixed.
>
> I have completed converting all of the permission checks to use the
> namespace-aware helpers or annotated them with comments indicating
> when it is correct to only check against the current SELinux
> namespace. For some of the checks, it is debatable as to which helper
> should be used, so we may need to revisit some of these based on
> experience.
>
> What remains to be done:
> 1. Maybe rework how policy capabilities are being checked/used to
> correctly support child namespaces with different policy capabilities
> from the parent. I can do this for some simple cases by lifting the
> logic to walk the namespaces up into the hook function itself and
> checking the policy capability value in each namespace, but many
> (most?) of the policy capabilities don't lend themselves to this
> approach. For example, extended_socket_class enables finer-grained
> socket security classes, but this is checked and applied when the
> socket security blob is initialized, not at permission check time.
> Unless we want to move the mapping logic to every permission check, I
> am not sure what can be done there. Similarly, a number of policy
> capabilities control labeling behaviors rather than permission checks,
> and since we are no longer trying to support per-namespace object
> SIDs/contexts, only one namespace's policy can be applied that label
> will then be used for all subsequent checks even in the other
> namespaces.
>
> 2. Decide if any further hardening of selinuxfs is required to safely
> permit usage by potentially untrusted / less trusted processes in
> child namespaces. There has already been a lot of work to harden e.g.
> the policy loading logic against ill-formed policies and such, so not
> sure if there is much to do here, but noting it. I would like to get
> rid of /sys/fs/selinux/user altogether so possibly making it
> inaccessible in child namespaces would be a good first step.
>
> 3. Optimize the implementation for the single SELinux namespace case,
> reducing and/or eliminating the overhead introduced by the SELinux
> namespace support for that common case. Lots of work to do here, help
> welcome. Also would appreciate guidance on current benchmarking
> practices since it has been a while since I've had to do that.
>
> 4. Revisit the userspace API for unsharing the SELinux namespace
> if/when the rest is ready. Currently just "echo 1 >
> /sys/fs/selinux/unshare" (followed by the other necessary steps for
> unsharing the mount namespace, unmounting the parent's selinuxfs,
> mounting a new selinuxfs for the child, loading a policy, and setting
> enforcing mode). Options would include adding a CLONE_SECURITY flag to
> unshare/clone that could be implemented by any/all LSMs via a call to
> a new (stacked) LSM hook function, or one or more new LSM system calls
> to do the same, or just keeping it the way it is via selinuxfs.
>
> Experimentation is welcome, particularly for more complex cases, e.g.
> where the host policy and the child policy differ (no policy loaded on
> host, policy in child; policy loaded on host, no policy in child; host
> policy from one distribution/release; child from another, etc). Be
> aware however that since the permission checks are applied to the
> current namespace and its ancestors, the parent namespace may deny
> something that would be allowed in the child, especially if the child
> is using contexts that are unknown to the parent's policy (which will
> be treated as unlabeled for those checks in the parent). Also be aware
> that since we are no longer trying to support per-namespace object
> SIDs/contexts, any object first instantiated in the parent namespace
> will be labeled according to its policy, not the child's policy.
>
> The tree can be found at:
> https://github.com/stephensmalley/selinux-kernel/tree/working-selinuxns
>
> It may be re-based or changed at any time.
> To experiment, after building and booting this kernel, do the following:
> # Create root shell
> sudo bash
> # Unshare SELinux namespace
> echo 1 > /sys/fs/selinux/unshare
> id # Context is now "init" or "kernel" in child; ps -eZ from parent
> will still show original context
> # Unshare mount namespace and mount new selinuxfs for child SELinux namespace
> unshare -m
> umount /sys/fs/selinux
> mount -t selinuxfs none /sys/fs/selinux
> # Load a policy into the child SELinux namespace, parent unaffected
> load_policy
> id # Context is now kernel_generic_helper_t on Fedora due to a default
> transition in its policy
> # Switch to a suitable security context before trying to go enforcing
> runcon unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 /bin/bash
> # Switch child to enforcing, checking that you didn't get killed once enforcing
> echo $$
> setenforce 1
> echo $$
> # Do stuff in child, run testsuite (switch parent to permissive first
> to avoid denials from it), etc.

Oops, I see that the selinux tree re-based to 6.12-rc1, so now
updating my branch to that.
There are conflicts so it may take a little bit.