Re: SELinux namespaces re-base

Stephen Smalley <stephen.smalley.work@xxxxxxxxx> · Wed, 9 Oct 2024 13:57:50 -0400

On Wed, Oct 9, 2024 at 9:09 AM Stephen Smalley
<stephen.smalley.work@xxxxxxxxx> wrote:
>
> On Tue, Oct 8, 2024 at 9:32 AM Stephen Smalley
> <stephen.smalley.work@xxxxxxxxx> wrote:
> > Re-based again on top of latest selinux/dev to resolve the conflicts
> > with the just-merged patches and to update the new netlink xperm
> > support for SELinux namespaces. Passes the selinux-testsuite including
> > the (not yet merged) nlmsg tests in both the init SELinux namespace
> > and a child SELinux namespace (modulo the labeled IPSEC tests and with
> > the init SELinux namespace permissive for testing the child or
> > modifying the init namespace policy to permit it to run all the tests
> > in the child context). Functionally, this is nearly complete as far as
> > SELinux-only changes go (not including the corresponding work needed
> > to namespace audit and if desired/necessary, to allow namespacing of
> > the labeled IPSEC hooks), modulo any bugs that get discovered in
> > trying to create real containers with their own SELinux namespaces and
> > different combinations of policies between the host OS and the
> > containers.
> >
> > My remaining ToDo list is as follows, but this is a good point for
> > others to provide feedback or experiment with the functionality or
> > propose their favorite container runtime for the next stages of
> > prototyping. If it would help spark feedback, I could post the current
> > set of kernel patches to the list.
> >
> > - Test creation/use of SELinux namespaces from actual containers with
> > different policies from the host OS. This may require patching a
> > container runtime to add support for unsharing the SELinux namespace
> > and unmounting the old selinuxfs prior to starting the container init.
> > Combinations to test: No policy loaded on host, policy loaded in
> > container e.g. Fedora on Ubuntu; host with newer base policy than
> > container e.g. RHEL/Rocky 8/9 on Fedora; container with newer base
> > policy than host e.g. Fedora on RHEL/Rocky 8/9; host and container
> > with different upstream policy sources e.g. Ubuntu on Fedora; Android
> > container on Linux host OS.
>
> To help get this started, I created a patch for libselinux to provide
> a selinux_unshare() API that unshares the SELinux namespace (hiding
> the current messy internal details of the existing kernel interface
> and also dealing with various situations under which it might be
> called by container runtimes with selinuxfs already mounted, bind
> mounted read-only, or not mounted at all) along with a sample
> unshareselinux utility that shows how to use it, and a patch for
> systemd-nspawn to show how it might be called from a container runtime
> to unshare the SELinux namespace during container creation. These can
> be found the selinuxns branches of my selinux userspace and systemd
> forks at:
> https://github.com/stephensmalley/selinux/tree/selinuxns
> and
> https://github.com/stephensmalley/systemd/tree/selinuxns
> respectively.
>
> While the patches appear to work correctly (i.e. you end up with a new
> SELinux namespace, after which you can mount a new selinuxfs that is
> private to your namespace, load a policy, set enforcing mode, etc),
> unfortunately it appears that systemd doesn't just do the Right Thing
> automatically when it is invoked as a container init after unsharing
> the SELinux namespace, i.e. it does not proceed to call the SELinux
> setup functionality so it never tries to mount selinuxfs and load a
> policy within the container. Unsurprising but it does mean that
> someone will need to modify it to do so in this case while ensuring
> that this doesn't break existing setups without the SELinux namespace
> functionality.

Pushed up a further commit to the branch on my fork of systemd to have
it call the SELinux setup + init functions if invoked from
systemd-nspawn with the SELinux namespace unshared. The existing
systemd was skipping setup/init of all of the MAC modules if running
in a container, which was understandable absent namespace support. My
current patch (just to allow further progress) only relaxes that
constraint for SELinux and only if launched via systemd-nspawn with
the --selinux-namespace option; this would of course be generalized
further if/when we get around to upstreaming it. With that change and
installing the modified systemd into the container root filesystem, I
can start a container via systemd-nspawn with the --selinux-namespace
option and have it unshare the SELinux namespace, load policy from the
container's root, and set its enforcing mode. At present, if the
container is configured to be enforcing, the container will fail due
to denials in the child SELinux namespace arising from the following:
- systemd creates a regular tmpfs mount for the container /dev, so at
least some of the /dev nodes are not correctly labeled at startup.
This can likely be fixed through some combination of policy and
perhaps performing a restorecon("/dev") after first loading policy.
- Certain /proc/sys files in the container are labeled with
"unlabeled_t" for some reason, likely due to being accessed n the
namespace before it loads a policy and not getting initialized
afterward. Similarly could be fixed via a restorecon("/proc") after
policy load if we can't solve it kernel-side.
- sendto permission denied from kernel_t and from init_t to
unconfined_t:unix_dgram_socket; this is likely the container sending
to a socket in the parent namespace.

There are no doubt more beyond these. However, in permissive (with the
parent/init namespace still enforcing), the container did come up
fully and sees SELinux as enabled.

> > - Rework how policy capabilities are being checked/used to correctly
> > support child namespaces with different policy capabilities from the
> > parent. This has already been done for the open_perms capability by
> > lifting the logic to walk the namespaces up into the hook function
> > itself and checking the policy capability value in each namespace, but
> > many (most?) of the policy capabilities don't lend themselves to this
> > approach. For example, extended_socket_class enables finer-grained
> > socket security classes, but this is checked and applied when the
> > socket security blob is initialized, not at permission check time.
> > Unless we want to move the mapping logic to every permission check, I
> > am not sure what can be done there. One option would be to
> > force-enable the same policy capabilities in the child namespace as in
> > the parent to avoid conflicts but this would limit the ability to use
> > differing policies. Similarly, a number of policy capabilities control
> > labeling behaviors rather than permission checks, and since we are no
> > longer trying to support per-namespace object SIDs/contexts, only one
> > namespace's policy can be applied that label will then be used for all
> > subsequent checks even in the other namespaces.
> >
> > - Decide if any further hardening of selinuxfs is required to safely
> > permit usage by potentially untrusted / less trusted processes in
> > child namespaces. There has already been a lot of work to harden e.g.
> > the policy loading logic against ill-formed policies and such, so not
> > sure if there is much to do here, but noting it as a possible area to
> > audit for safety.
> >
> > - Ensure that we are correctly handling peer and packet labels when
> > they cross SELinux namespaces, for some definition thereof, both wrt
> > permission checking and wrt the peer/packet labels that are exposed to
> > userspace via getsockopt(SO_PEERSEC), recvmsg() SCM_SECURITY, etc.
> >
> > - Optimize the implementation for the single SELinux namespace case,
> > reducing and/or eliminating the overhead introduced by the SELinux
> > namespace support for that common case. Lots of work to do here, help
> > welcome. Also would appreciate guidance on current Linux kernel
> > benchmarking best practices since it has been a while since I've had
> > to do that.
> >
> > - Re-test with KASAN and with KCSAN enabled to confirm that the
> > namespace patches haven't introduced any memory errors or race
> > conditions; I have tested with each of these in the past successfully
> > but don't keep them enabled generally because they make everything
> > very slow. And you can't have them both enabled together at runtime
> > AFAICT.
> >
> > - Revisit the userspace API for unsharing the SELinux namespace
> > if/when the rest is ready. Currently just "echo 1 >
> > /sys/fs/selinux/unshare" (followed by the other necessary steps for
> > unsharing the mount namespace, unmounting the parent's selinuxfs,
> > mounting a new selinuxfs for the child, loading a policy, and setting
> > enforcing mode). Options would include adding a CLONE_SECURITY flag to
> > unshare/clone that could be implemented by any/all LSMs via a call to
> > a new (stacked) LSM hook function, or one or more new LSM system calls
> > to do the same, or just keeping it the way it is via selinuxfs.
> >
> > - Upstream the kernel support.
> >
> > - Figure out how to combine the use of SELinux namespaces with Red
> > Hat's current model of isolating and confining containers as a whole
> > via SELinux on the host OS. This is complicated by the fact that we
> > are only supporting a single inode SID/context per inode (gave up on
> > per-namespace inode SID/contexts, see the earlier mailing list
> > discussions), and Red Hat's current model uses context mounts to
> > assign a single security context to all the inodes used by the
> > container. Possibly introduce a new kind of context= mount that is
> > namespace-aware, i.e. only apply the context mount when in the outer
> > namespace but use the inode xattrs inside the child namespace.
> >
> > - Integrate and upstream userspace support into appropriate container
> > runtimes, e.g. systemd-nspawn, crun/runc, podman, docker, k8s, etc.