Re: SELinux namespaces re-base

Stephen Smalley <stephen.smalley.work@xxxxxxxxx> · Mon, 7 Oct 2024 16:01:46 -0400

On Thu, Oct 3, 2024 at 4:11 PM Stephen Smalley
<stephen.smalley.work@xxxxxxxxx> wrote:
>
> On Thu, Oct 3, 2024 at 2:29 PM Stephen Smalley
> <stephen.smalley.work@xxxxxxxxx> wrote:
> >
> > On Thu, Oct 3, 2024 at 1:04 PM Stephen Smalley
> > <stephen.smalley.work@xxxxxxxxx> wrote:
> > > Based on our discussion at the last project meeting, I removed the
> > > requirement to unshare the network namespace when unsharing the
> > > SELinux namespace by adding a check in selnl_notify() to only send the
> > > SELinux netlink notifications to the init network namespace if the
> > > triggering process is in the init SELinux namespace. Hence, the
> > > creator of a child SELinux namespace can either choose to unshare the
> > > network namespace if they want to receive such netlink notifications
> > > (in which case they will be sent to that child network namespace
> > > only), or they can just use the SELinux status page exported by
> > > /sys/fs/selinux/status, which is the default in libselinux for kernels
> > > that support it.
> > >
> > > With that change, I can now run all of the selinux-testsuite tests
> > > successfully from a child SELinux namespace except for two labeled
> > > IPSEC tests each for inet_socket/tcp, inet_socket/udp, and
> > > inet_socket/mptcp. To fully pass the other tests, I had to also put
> > > the parent namespace into permissive mode to avoid certain failures
> > > due to MCS constraints in the base policy that can't be overridden via
> > > the test policy. The remaining labeled IPSEC test failures are likely
> > > due to the fact that the xfrm hooks are not passed a sock structure or
> > > anything else from which I can obtain the appropriate SELinux
> > > namespace to use so they are hardcoded to use the init SELinux
> > > namespace and even when it is permissive, there are hardcoded SID
> > > comparisons in those hooks that are likely failing.
> > >
> > > I also introduced configurable limits for the maximum number of
> > > SELinux namespaces and for the maximum depth to which they can be
> > > nested. The default values of each can be controlled via Kconfig
> > > options, which default to 65535 and 32 respectively (matching user
> > > namespaces), and can be further adjusted via /sys/fs/selinux/maxns and
> > > /sys/fs/selinux/maxnsdepth respectively but only from the init SELinux
> > > namespace (child namespaces can read but not modify them). A simple
> > > pair of test scripts to recursively create SELinux namespaces
> > > correctly failed when it hit the maxnsdepth and lowering the maxns
> > > value correctly prevented exceeding that number of total namespaces.
> > > These tests however exposed a couple of reference counting bugs in the
> > > code (one for SELinux namespaces, one for the parent cred that we
> > > cache in the task security blob for use in checks on the parent
> > > namespace), which are now also fixed.
> > >
> > > I have completed converting all of the permission checks to use the
> > > namespace-aware helpers or annotated them with comments indicating
> > > when it is correct to only check against the current SELinux
> > > namespace. For some of the checks, it is debatable as to which helper
> > > should be used, so we may need to revisit some of these based on
> > > experience.
> > >
> > > What remains to be done:
> > > 1. Maybe rework how policy capabilities are being checked/used to
> > > correctly support child namespaces with different policy capabilities
> > > from the parent. I can do this for some simple cases by lifting the
> > > logic to walk the namespaces up into the hook function itself and
> > > checking the policy capability value in each namespace, but many
> > > (most?) of the policy capabilities don't lend themselves to this
> > > approach. For example, extended_socket_class enables finer-grained
> > > socket security classes, but this is checked and applied when the
> > > socket security blob is initialized, not at permission check time.
> > > Unless we want to move the mapping logic to every permission check, I
> > > am not sure what can be done there. Similarly, a number of policy
> > > capabilities control labeling behaviors rather than permission checks,
> > > and since we are no longer trying to support per-namespace object
> > > SIDs/contexts, only one namespace's policy can be applied that label
> > > will then be used for all subsequent checks even in the other
> > > namespaces.
> > >
> > > 2. Decide if any further hardening of selinuxfs is required to safely
> > > permit usage by potentially untrusted / less trusted processes in
> > > child namespaces. There has already been a lot of work to harden e.g.
> > > the policy loading logic against ill-formed policies and such, so not
> > > sure if there is much to do here, but noting it. I would like to get
> > > rid of /sys/fs/selinux/user altogether so possibly making it
> > > inaccessible in child namespaces would be a good first step.
> > >
> > > 3. Optimize the implementation for the single SELinux namespace case,
> > > reducing and/or eliminating the overhead introduced by the SELinux
> > > namespace support for that common case. Lots of work to do here, help
> > > welcome. Also would appreciate guidance on current benchmarking
> > > practices since it has been a while since I've had to do that.
> > >
> > > 4. Revisit the userspace API for unsharing the SELinux namespace
> > > if/when the rest is ready. Currently just "echo 1 >
> > > /sys/fs/selinux/unshare" (followed by the other necessary steps for
> > > unsharing the mount namespace, unmounting the parent's selinuxfs,
> > > mounting a new selinuxfs for the child, loading a policy, and setting
> > > enforcing mode). Options would include adding a CLONE_SECURITY flag to
> > > unshare/clone that could be implemented by any/all LSMs via a call to
> > > a new (stacked) LSM hook function, or one or more new LSM system calls
> > > to do the same, or just keeping it the way it is via selinuxfs.
> > >
> > > Experimentation is welcome, particularly for more complex cases, e.g.
> > > where the host policy and the child policy differ (no policy loaded on
> > > host, policy in child; policy loaded on host, no policy in child; host
> > > policy from one distribution/release; child from another, etc). Be
> > > aware however that since the permission checks are applied to the
> > > current namespace and its ancestors, the parent namespace may deny
> > > something that would be allowed in the child, especially if the child
> > > is using contexts that are unknown to the parent's policy (which will
> > > be treated as unlabeled for those checks in the parent). Also be aware
> > > that since we are no longer trying to support per-namespace object
> > > SIDs/contexts, any object first instantiated in the parent namespace
> > > will be labeled according to its policy, not the child's policy.
> > >
> > > The tree can be found at:
> > > https://github.com/stephensmalley/selinux-kernel/tree/working-selinuxns
> > >
> > > It may be re-based or changed at any time.
> > > To experiment, after building and booting this kernel, do the following:
> > > # Create root shell
> > > sudo bash
> > > # Unshare SELinux namespace
> > > echo 1 > /sys/fs/selinux/unshare
> > > id # Context is now "init" or "kernel" in child; ps -eZ from parent
> > > will still show original context
> > > # Unshare mount namespace and mount new selinuxfs for child SELinux namespace
> > > unshare -m
> > > umount /sys/fs/selinux
> > > mount -t selinuxfs none /sys/fs/selinux
> > > # Load a policy into the child SELinux namespace, parent unaffected
> > > load_policy
> > > id # Context is now kernel_generic_helper_t on Fedora due to a default
> > > transition in its policy
> > > # Switch to a suitable security context before trying to go enforcing
> > > runcon unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 /bin/bash
> > > # Switch child to enforcing, checking that you didn't get killed once enforcing
> > > echo $$
> > > setenforce 1
> > > echo $$
> > > # Do stuff in child, run testsuite (switch parent to permissive first
> > > to avoid denials from it), etc.
> >
> > Oops, I see that the selinux tree re-based to 6.12-rc1, so now
> > updating my branch to that.
> > There are conflicts so it may take a little bit.
>
> Wasn't too bad. Now re-based on 6.12-rc1.

Re-based again on latest selinux/dev, and pushed a few more changes:
- selinux: make open_perms namespace-aware; demonstrates how to
integrate the namespace-based checking with the open_perms policy
capability so that we only check file open permission in namespaces
that enable the capability in their policy. That was the easy case;
the rest of the policy capabilities are less clear on how to resolve
as per my earlier description.
-   selinux: split cred_ssid_has_perm() into two cases; alters how the
namespace-based checking is applied to socket and SysV IPC checks
based on some testing
-  selinux: set initial SID context for init to "kernel" in global SID
table; fixes what would be a userspace compatibility problem for init
in child namespaces by duplicating what we were already doing in the
security server's per-policy SID table.

In addition to running the testsuite in a child SELinux namespace as
before, I also launched a RHEL9 UBI container, manually unshared the
SELinux namespace in a shell within it, and loaded the RHEL9 policy
into the child namespace to simulate what we would ultimately want the
container runtime to do. That motivated the latter two changes above.
Not quite ready to put that into enforcing mode due to some denials
between container processes and host resources (like the inherited
open fds) but a step forward.
I think to better understand what else is needed, it would help to
prototype modifications to a container runtime to unshare the SELinux
namespace (just using the current /sys/fs/selinux/unshare interface
for now) before launching the container init, and see if the container
init does the right thing (i.e. concludes that SELinux doesn't yet
have a policy loaded based on /proc/self/attr/current=="kernel",
mounts its own /sys/fs/selinux private to its namespace, and loads its
policy into it). Thoughts on what the easiest container runtime to
patch in this way would be, or anyone want to try?