Re: SELinux namespaces re-base

Stephen Smalley <stephen.smalley.work@xxxxxxxxx> · Tue, 8 Oct 2024 09:32:44 -0400

On Mon, Oct 7, 2024 at 4:01 PM Stephen Smalley
<stephen.smalley.work@xxxxxxxxx> wrote:
> Re-based again on latest selinux/dev, and pushed a few more changes:
> - selinux: make open_perms namespace-aware; demonstrates how to
> integrate the namespace-based checking with the open_perms policy
> capability so that we only check file open permission in namespaces
> that enable the capability in their policy. That was the easy case;
> the rest of the policy capabilities are less clear on how to resolve
> as per my earlier description.
> -   selinux: split cred_ssid_has_perm() into two cases; alters how the
> namespace-based checking is applied to socket and SysV IPC checks
> based on some testing
> -  selinux: set initial SID context for init to "kernel" in global SID
> table; fixes what would be a userspace compatibility problem for init
> in child namespaces by duplicating what we were already doing in the
> security server's per-policy SID table.
>
> In addition to running the testsuite in a child SELinux namespace as
> before, I also launched a RHEL9 UBI container, manually unshared the
> SELinux namespace in a shell within it, and loaded the RHEL9 policy
> into the child namespace to simulate what we would ultimately want the
> container runtime to do. That motivated the latter two changes above.
> Not quite ready to put that into enforcing mode due to some denials
> between container processes and host resources (like the inherited
> open fds) but a step forward.
> I think to better understand what else is needed, it would help to
> prototype modifications to a container runtime to unshare the SELinux
> namespace (just using the current /sys/fs/selinux/unshare interface
> for now) before launching the container init, and see if the container
> init does the right thing (i.e. concludes that SELinux doesn't yet
> have a policy loaded based on /proc/self/attr/current=="kernel",
> mounts its own /sys/fs/selinux private to its namespace, and loads its
> policy into it). Thoughts on what the easiest container runtime to
> patch in this way would be, or anyone want to try?

Re-based again on top of latest selinux/dev to resolve the conflicts
with the just-merged patches and to update the new netlink xperm
support for SELinux namespaces. Passes the selinux-testsuite including
the (not yet merged) nlmsg tests in both the init SELinux namespace
and a child SELinux namespace (modulo the labeled IPSEC tests and with
the init SELinux namespace permissive for testing the child or
modifying the init namespace policy to permit it to run all the tests
in the child context). Functionally, this is nearly complete as far as
SELinux-only changes go (not including the corresponding work needed
to namespace audit and if desired/necessary, to allow namespacing of
the labeled IPSEC hooks), modulo any bugs that get discovered in
trying to create real containers with their own SELinux namespaces and
different combinations of policies between the host OS and the
containers.

My remaining ToDo list is as follows, but this is a good point for
others to provide feedback or experiment with the functionality or
propose their favorite container runtime for the next stages of
prototyping. If it would help spark feedback, I could post the current
set of kernel patches to the list.

- Test creation/use of SELinux namespaces from actual containers with
different policies from the host OS. This may require patching a
container runtime to add support for unsharing the SELinux namespace
and unmounting the old selinuxfs prior to starting the container init.
Combinations to test: No policy loaded on host, policy loaded in
container e.g. Fedora on Ubuntu; host with newer base policy than
container e.g. RHEL/Rocky 8/9 on Fedora; container with newer base
policy than host e.g. Fedora on RHEL/Rocky 8/9; host and container
with different upstream policy sources e.g. Ubuntu on Fedora; Android
container on Linux host OS.

- Rework how policy capabilities are being checked/used to correctly
support child namespaces with different policy capabilities from the
parent. This has already been done for the open_perms capability by
lifting the logic to walk the namespaces up into the hook function
itself and checking the policy capability value in each namespace, but
many (most?) of the policy capabilities don't lend themselves to this
approach. For example, extended_socket_class enables finer-grained
socket security classes, but this is checked and applied when the
socket security blob is initialized, not at permission check time.
Unless we want to move the mapping logic to every permission check, I
am not sure what can be done there. One option would be to
force-enable the same policy capabilities in the child namespace as in
the parent to avoid conflicts but this would limit the ability to use
differing policies. Similarly, a number of policy capabilities control
labeling behaviors rather than permission checks, and since we are no
longer trying to support per-namespace object SIDs/contexts, only one
namespace's policy can be applied that label will then be used for all
subsequent checks even in the other namespaces.

- Decide if any further hardening of selinuxfs is required to safely
permit usage by potentially untrusted / less trusted processes in
child namespaces. There has already been a lot of work to harden e.g.
the policy loading logic against ill-formed policies and such, so not
sure if there is much to do here, but noting it as a possible area to
audit for safety.

- Ensure that we are correctly handling peer and packet labels when
they cross SELinux namespaces, for some definition thereof, both wrt
permission checking and wrt the peer/packet labels that are exposed to
userspace via getsockopt(SO_PEERSEC), recvmsg() SCM_SECURITY, etc.

- Optimize the implementation for the single SELinux namespace case,
reducing and/or eliminating the overhead introduced by the SELinux
namespace support for that common case. Lots of work to do here, help
welcome. Also would appreciate guidance on current Linux kernel
benchmarking best practices since it has been a while since I've had
to do that.

- Re-test with KASAN and with KCSAN enabled to confirm that the
namespace patches haven't introduced any memory errors or race
conditions; I have tested with each of these in the past successfully
but don't keep them enabled generally because they make everything
very slow. And you can't have them both enabled together at runtime
AFAICT.

- Revisit the userspace API for unsharing the SELinux namespace
if/when the rest is ready. Currently just "echo 1 >
/sys/fs/selinux/unshare" (followed by the other necessary steps for
unsharing the mount namespace, unmounting the parent's selinuxfs,
mounting a new selinuxfs for the child, loading a policy, and setting
enforcing mode). Options would include adding a CLONE_SECURITY flag to
unshare/clone that could be implemented by any/all LSMs via a call to
a new (stacked) LSM hook function, or one or more new LSM system calls
to do the same, or just keeping it the way it is via selinuxfs.

- Upstream the kernel support.

- Figure out how to combine the use of SELinux namespaces with Red
Hat's current model of isolating and confining containers as a whole
via SELinux on the host OS. This is complicated by the fact that we
are only supporting a single inode SID/context per inode (gave up on
per-namespace inode SID/contexts, see the earlier mailing list
discussions), and Red Hat's current model uses context mounts to
assign a single security context to all the inodes used by the
container. Possibly introduce a new kind of context= mount that is
namespace-aware, i.e. only apply the context mount when in the outer
namespace but use the inode xattrs inside the child namespace.

- Integrate and upstream userspace support into appropriate container
runtimes, e.g. systemd-nspawn, crun/runc, podman, docker, k8s, etc.