Re: SELinux namespaces re-base

Stephen Smalley <stephen.smalley.work@xxxxxxxxx> · Fri, 11 Oct 2024 09:51:25 -0400

On Thu, Oct 10, 2024 at 10:30 AM Stephen Smalley
<stephen.smalley.work@xxxxxxxxx> wrote:
>
> On Wed, Oct 9, 2024 at 3:25 PM Stephen Smalley
> <stephen.smalley.work@xxxxxxxxx> wrote:
> >
> > On Wed, Oct 9, 2024 at 1:57 PM Stephen Smalley
> > <stephen.smalley.work@xxxxxxxxx> wrote:
> > >
> > > On Wed, Oct 9, 2024 at 9:09 AM Stephen Smalley
> > > <stephen.smalley.work@xxxxxxxxx> wrote:
> > > >
> > > > On Tue, Oct 8, 2024 at 9:32 AM Stephen Smalley
> > > > <stephen.smalley.work@xxxxxxxxx> wrote:
> > > > > Re-based again on top of latest selinux/dev to resolve the conflicts
> > > > > with the just-merged patches and to update the new netlink xperm
> > > > > support for SELinux namespaces. Passes the selinux-testsuite including
> > > > > the (not yet merged) nlmsg tests in both the init SELinux namespace
> > > > > and a child SELinux namespace (modulo the labeled IPSEC tests and with
> > > > > the init SELinux namespace permissive for testing the child or
> > > > > modifying the init namespace policy to permit it to run all the tests
> > > > > in the child context). Functionally, this is nearly complete as far as
> > > > > SELinux-only changes go (not including the corresponding work needed
> > > > > to namespace audit and if desired/necessary, to allow namespacing of
> > > > > the labeled IPSEC hooks), modulo any bugs that get discovered in
> > > > > trying to create real containers with their own SELinux namespaces and
> > > > > different combinations of policies between the host OS and the
> > > > > containers.
> > > > >
> > > > > My remaining ToDo list is as follows, but this is a good point for
> > > > > others to provide feedback or experiment with the functionality or
> > > > > propose their favorite container runtime for the next stages of
> > > > > prototyping. If it would help spark feedback, I could post the current
> > > > > set of kernel patches to the list.
> > > > >
> > > > > - Test creation/use of SELinux namespaces from actual containers with
> > > > > different policies from the host OS. This may require patching a
> > > > > container runtime to add support for unsharing the SELinux namespace
> > > > > and unmounting the old selinuxfs prior to starting the container init.
> > > > > Combinations to test: No policy loaded on host, policy loaded in
> > > > > container e.g. Fedora on Ubuntu; host with newer base policy than
> > > > > container e.g. RHEL/Rocky 8/9 on Fedora; container with newer base
> > > > > policy than host e.g. Fedora on RHEL/Rocky 8/9; host and container
> > > > > with different upstream policy sources e.g. Ubuntu on Fedora; Android
> > > > > container on Linux host OS.
> > > >
> > > > To help get this started, I created a patch for libselinux to provide
> > > > a selinux_unshare() API that unshares the SELinux namespace (hiding
> > > > the current messy internal details of the existing kernel interface
> > > > and also dealing with various situations under which it might be
> > > > called by container runtimes with selinuxfs already mounted, bind
> > > > mounted read-only, or not mounted at all) along with a sample
> > > > unshareselinux utility that shows how to use it, and a patch for
> > > > systemd-nspawn to show how it might be called from a container runtime
> > > > to unshare the SELinux namespace during container creation. These can
> > > > be found the selinuxns branches of my selinux userspace and systemd
> > > > forks at:
> > > > https://github.com/stephensmalley/selinux/tree/selinuxns
> > > > and
> > > > https://github.com/stephensmalley/systemd/tree/selinuxns
> > > > respectively.
> > > >
> > > > While the patches appear to work correctly (i.e. you end up with a new
> > > > SELinux namespace, after which you can mount a new selinuxfs that is
> > > > private to your namespace, load a policy, set enforcing mode, etc),
> > > > unfortunately it appears that systemd doesn't just do the Right Thing
> > > > automatically when it is invoked as a container init after unsharing
> > > > the SELinux namespace, i.e. it does not proceed to call the SELinux
> > > > setup functionality so it never tries to mount selinuxfs and load a
> > > > policy within the container. Unsurprising but it does mean that
> > > > someone will need to modify it to do so in this case while ensuring
> > > > that this doesn't break existing setups without the SELinux namespace
> > > > functionality.
> > >
> > > Pushed up a further commit to the branch on my fork of systemd to have
> > > it call the SELinux setup + init functions if invoked from
> > > systemd-nspawn with the SELinux namespace unshared. The existing
> > > systemd was skipping setup/init of all of the MAC modules if running
> > > in a container, which was understandable absent namespace support. My
> > > current patch (just to allow further progress) only relaxes that
> > > constraint for SELinux and only if launched via systemd-nspawn with
> > > the --selinux-namespace option; this would of course be generalized
> > > further if/when we get around to upstreaming it. With that change and
> > > installing the modified systemd into the container root filesystem, I
> > > can start a container via systemd-nspawn with the --selinux-namespace
> > > option and have it unshare the SELinux namespace, load policy from the
> > > container's root, and set its enforcing mode. At present, if the
> > > container is configured to be enforcing, the container will fail due
> > > to denials in the child SELinux namespace arising from the following:
> > > - systemd creates a regular tmpfs mount for the container /dev, so at
> > > least some of the /dev nodes are not correctly labeled at startup.
> > > This can likely be fixed through some combination of policy and
> > > perhaps performing a restorecon("/dev") after first loading policy.
> > > - Certain /proc/sys files in the container are labeled with
> > > "unlabeled_t" for some reason, likely due to being accessed n the
> > > namespace before it loads a policy and not getting initialized
> > > afterward. Similarly could be fixed via a restorecon("/proc") after
> > > policy load if we can't solve it kernel-side.
> >
> > Sorry, obviously can't do a restorecon of /proc so that's not an option.
> > I suspect that the existing selinux_complete_init() walk of
> > uninitialized superblocks and their inodes after first policy load
> > isn't getting done properly for child SELinux namespaces; will have to
> > look into that on the kernel side.
>
> Yes, that was the issue. Fixed with another commit pushed up to the
> working-selinuxns branch of my selinux-kernel fork. So the /proc
> labeling is fixed within the container. Still have the other denials
> to address but those might all be userspace or policy fixes.

Ok, I confirmed that the remaining denials are due to multiple tmpfs
mounts and a socket created by systemd-nspawn during setup of the
container that are then used by the container at runtime, and I
confirmed that allowing those permissions in the container policy
enables a Fedora container to boot in enforcing mode with its own
SELinux namespace on a Fedora host in enforcing mode. Ultimately we
will want the container runtime (systemd-nspawn in this case) to
properly label those tmpfs mounts and the socket but that's just a
matter of further userspace changes to systemd-nspawn.

Still lots to do to allow more interesting combinations but I'll leave
it there for a bit and see if anyone is actually interested in this
besides me...