On Wed, Oct 9, 2024 at 1:57 PM Stephen Smalley <stephen.smalley.work@xxxxxxxxx> wrote: > > On Wed, Oct 9, 2024 at 9:09 AM Stephen Smalley > <stephen.smalley.work@xxxxxxxxx> wrote: > > > > On Tue, Oct 8, 2024 at 9:32 AM Stephen Smalley > > <stephen.smalley.work@xxxxxxxxx> wrote: > > > Re-based again on top of latest selinux/dev to resolve the conflicts > > > with the just-merged patches and to update the new netlink xperm > > > support for SELinux namespaces. Passes the selinux-testsuite including > > > the (not yet merged) nlmsg tests in both the init SELinux namespace > > > and a child SELinux namespace (modulo the labeled IPSEC tests and with > > > the init SELinux namespace permissive for testing the child or > > > modifying the init namespace policy to permit it to run all the tests > > > in the child context). Functionally, this is nearly complete as far as > > > SELinux-only changes go (not including the corresponding work needed > > > to namespace audit and if desired/necessary, to allow namespacing of > > > the labeled IPSEC hooks), modulo any bugs that get discovered in > > > trying to create real containers with their own SELinux namespaces and > > > different combinations of policies between the host OS and the > > > containers. > > > > > > My remaining ToDo list is as follows, but this is a good point for > > > others to provide feedback or experiment with the functionality or > > > propose their favorite container runtime for the next stages of > > > prototyping. If it would help spark feedback, I could post the current > > > set of kernel patches to the list. > > > > > > - Test creation/use of SELinux namespaces from actual containers with > > > different policies from the host OS. This may require patching a > > > container runtime to add support for unsharing the SELinux namespace > > > and unmounting the old selinuxfs prior to starting the container init. > > > Combinations to test: No policy loaded on host, policy loaded in > > > container e.g. Fedora on Ubuntu; host with newer base policy than > > > container e.g. RHEL/Rocky 8/9 on Fedora; container with newer base > > > policy than host e.g. Fedora on RHEL/Rocky 8/9; host and container > > > with different upstream policy sources e.g. Ubuntu on Fedora; Android > > > container on Linux host OS. > > > > To help get this started, I created a patch for libselinux to provide > > a selinux_unshare() API that unshares the SELinux namespace (hiding > > the current messy internal details of the existing kernel interface > > and also dealing with various situations under which it might be > > called by container runtimes with selinuxfs already mounted, bind > > mounted read-only, or not mounted at all) along with a sample > > unshareselinux utility that shows how to use it, and a patch for > > systemd-nspawn to show how it might be called from a container runtime > > to unshare the SELinux namespace during container creation. These can > > be found the selinuxns branches of my selinux userspace and systemd > > forks at: > > https://github.com/stephensmalley/selinux/tree/selinuxns > > and > > https://github.com/stephensmalley/systemd/tree/selinuxns > > respectively. > > > > While the patches appear to work correctly (i.e. you end up with a new > > SELinux namespace, after which you can mount a new selinuxfs that is > > private to your namespace, load a policy, set enforcing mode, etc), > > unfortunately it appears that systemd doesn't just do the Right Thing > > automatically when it is invoked as a container init after unsharing > > the SELinux namespace, i.e. it does not proceed to call the SELinux > > setup functionality so it never tries to mount selinuxfs and load a > > policy within the container. Unsurprising but it does mean that > > someone will need to modify it to do so in this case while ensuring > > that this doesn't break existing setups without the SELinux namespace > > functionality. > > Pushed up a further commit to the branch on my fork of systemd to have > it call the SELinux setup + init functions if invoked from > systemd-nspawn with the SELinux namespace unshared. The existing > systemd was skipping setup/init of all of the MAC modules if running > in a container, which was understandable absent namespace support. My > current patch (just to allow further progress) only relaxes that > constraint for SELinux and only if launched via systemd-nspawn with > the --selinux-namespace option; this would of course be generalized > further if/when we get around to upstreaming it. With that change and > installing the modified systemd into the container root filesystem, I > can start a container via systemd-nspawn with the --selinux-namespace > option and have it unshare the SELinux namespace, load policy from the > container's root, and set its enforcing mode. At present, if the > container is configured to be enforcing, the container will fail due > to denials in the child SELinux namespace arising from the following: > - systemd creates a regular tmpfs mount for the container /dev, so at > least some of the /dev nodes are not correctly labeled at startup. > This can likely be fixed through some combination of policy and > perhaps performing a restorecon("/dev") after first loading policy. > - Certain /proc/sys files in the container are labeled with > "unlabeled_t" for some reason, likely due to being accessed n the > namespace before it loads a policy and not getting initialized > afterward. Similarly could be fixed via a restorecon("/proc") after > policy load if we can't solve it kernel-side. Sorry, obviously can't do a restorecon of /proc so that's not an option. I suspect that the existing selinux_complete_init() walk of uninitialized superblocks and their inodes after first policy load isn't getting done properly for child SELinux namespaces; will have to look into that on the kernel side. > - sendto permission denied from kernel_t and from init_t to > unconfined_t:unix_dgram_socket; this is likely the container sending > to a socket in the parent namespace. > > There are no doubt more beyond these. However, in permissive (with the > parent/init namespace still enforcing), the container did come up > fully and sees SELinux as enabled.