On Thu, Oct 10, 2024 at 10:30 AM Stephen Smalley <stephen.smalley.work@xxxxxxxxx> wrote: > > On Wed, Oct 9, 2024 at 3:25 PM Stephen Smalley > <stephen.smalley.work@xxxxxxxxx> wrote: > > > > On Wed, Oct 9, 2024 at 1:57 PM Stephen Smalley > > <stephen.smalley.work@xxxxxxxxx> wrote: > > > > > > On Wed, Oct 9, 2024 at 9:09 AM Stephen Smalley > > > <stephen.smalley.work@xxxxxxxxx> wrote: > > > > > > > > On Tue, Oct 8, 2024 at 9:32 AM Stephen Smalley > > > > <stephen.smalley.work@xxxxxxxxx> wrote: > > > > > Re-based again on top of latest selinux/dev to resolve the conflicts > > > > > with the just-merged patches and to update the new netlink xperm > > > > > support for SELinux namespaces. Passes the selinux-testsuite including > > > > > the (not yet merged) nlmsg tests in both the init SELinux namespace > > > > > and a child SELinux namespace (modulo the labeled IPSEC tests and with > > > > > the init SELinux namespace permissive for testing the child or > > > > > modifying the init namespace policy to permit it to run all the tests > > > > > in the child context). Functionally, this is nearly complete as far as > > > > > SELinux-only changes go (not including the corresponding work needed > > > > > to namespace audit and if desired/necessary, to allow namespacing of > > > > > the labeled IPSEC hooks), modulo any bugs that get discovered in > > > > > trying to create real containers with their own SELinux namespaces and > > > > > different combinations of policies between the host OS and the > > > > > containers. > > > > > > > > > > My remaining ToDo list is as follows, but this is a good point for > > > > > others to provide feedback or experiment with the functionality or > > > > > propose their favorite container runtime for the next stages of > > > > > prototyping. If it would help spark feedback, I could post the current > > > > > set of kernel patches to the list. > > > > > > > > > > - Test creation/use of SELinux namespaces from actual containers with > > > > > different policies from the host OS. This may require patching a > > > > > container runtime to add support for unsharing the SELinux namespace > > > > > and unmounting the old selinuxfs prior to starting the container init. > > > > > Combinations to test: No policy loaded on host, policy loaded in > > > > > container e.g. Fedora on Ubuntu; host with newer base policy than > > > > > container e.g. RHEL/Rocky 8/9 on Fedora; container with newer base > > > > > policy than host e.g. Fedora on RHEL/Rocky 8/9; host and container > > > > > with different upstream policy sources e.g. Ubuntu on Fedora; Android > > > > > container on Linux host OS. > > > > > > > > To help get this started, I created a patch for libselinux to provide > > > > a selinux_unshare() API that unshares the SELinux namespace (hiding > > > > the current messy internal details of the existing kernel interface > > > > and also dealing with various situations under which it might be > > > > called by container runtimes with selinuxfs already mounted, bind > > > > mounted read-only, or not mounted at all) along with a sample > > > > unshareselinux utility that shows how to use it, and a patch for > > > > systemd-nspawn to show how it might be called from a container runtime > > > > to unshare the SELinux namespace during container creation. These can > > > > be found the selinuxns branches of my selinux userspace and systemd > > > > forks at: > > > > https://github.com/stephensmalley/selinux/tree/selinuxns > > > > and > > > > https://github.com/stephensmalley/systemd/tree/selinuxns > > > > respectively. > > > > > > > > While the patches appear to work correctly (i.e. you end up with a new > > > > SELinux namespace, after which you can mount a new selinuxfs that is > > > > private to your namespace, load a policy, set enforcing mode, etc), > > > > unfortunately it appears that systemd doesn't just do the Right Thing > > > > automatically when it is invoked as a container init after unsharing > > > > the SELinux namespace, i.e. it does not proceed to call the SELinux > > > > setup functionality so it never tries to mount selinuxfs and load a > > > > policy within the container. Unsurprising but it does mean that > > > > someone will need to modify it to do so in this case while ensuring > > > > that this doesn't break existing setups without the SELinux namespace > > > > functionality. > > > > > > Pushed up a further commit to the branch on my fork of systemd to have > > > it call the SELinux setup + init functions if invoked from > > > systemd-nspawn with the SELinux namespace unshared. The existing > > > systemd was skipping setup/init of all of the MAC modules if running > > > in a container, which was understandable absent namespace support. My > > > current patch (just to allow further progress) only relaxes that > > > constraint for SELinux and only if launched via systemd-nspawn with > > > the --selinux-namespace option; this would of course be generalized > > > further if/when we get around to upstreaming it. With that change and > > > installing the modified systemd into the container root filesystem, I > > > can start a container via systemd-nspawn with the --selinux-namespace > > > option and have it unshare the SELinux namespace, load policy from the > > > container's root, and set its enforcing mode. At present, if the > > > container is configured to be enforcing, the container will fail due > > > to denials in the child SELinux namespace arising from the following: > > > - systemd creates a regular tmpfs mount for the container /dev, so at > > > least some of the /dev nodes are not correctly labeled at startup. > > > This can likely be fixed through some combination of policy and > > > perhaps performing a restorecon("/dev") after first loading policy. > > > - Certain /proc/sys files in the container are labeled with > > > "unlabeled_t" for some reason, likely due to being accessed n the > > > namespace before it loads a policy and not getting initialized > > > afterward. Similarly could be fixed via a restorecon("/proc") after > > > policy load if we can't solve it kernel-side. > > > > Sorry, obviously can't do a restorecon of /proc so that's not an option. > > I suspect that the existing selinux_complete_init() walk of > > uninitialized superblocks and their inodes after first policy load > > isn't getting done properly for child SELinux namespaces; will have to > > look into that on the kernel side. > > Yes, that was the issue. Fixed with another commit pushed up to the > working-selinuxns branch of my selinux-kernel fork. So the /proc > labeling is fixed within the container. Still have the other denials > to address but those might all be userspace or policy fixes. Ok, I confirmed that the remaining denials are due to multiple tmpfs mounts and a socket created by systemd-nspawn during setup of the container that are then used by the container at runtime, and I confirmed that allowing those permissions in the container policy enables a Fedora container to boot in enforcing mode with its own SELinux namespace on a Fedora host in enforcing mode. Ultimately we will want the container runtime (systemd-nspawn in this case) to properly label those tmpfs mounts and the socket but that's just a matter of further userspace changes to systemd-nspawn. Still lots to do to allow more interesting combinations but I'll leave it there for a bit and see if anyone is actually interested in this besides me...