Re: SELinux namespaces re-base

Stephen Smalley <stephen.smalley.work@xxxxxxxxx> · Mon, 30 Sep 2024 15:06:23 -0400

On Mon, Sep 30, 2024 at 2:12 PM Topi Miettinen <toiwoton@xxxxxxxxx> wrote:
>
> Hi,
>
> I wonder if SELinux namespaces could be used for sandboxing,
> specifically with systemd. When enabled for a service with a directive
> (something like NamespacedSELinuxPolicy=path), PID1 could load a service
> specific namespaced policy and apply it to the service as it starts.
> These kind of policies could be extremely minimal and hardened when
> optimized.
>
> The implementation should avoid interfering with other sandboxing
> activities and also avoid AVC pollution from them, so preferably there
> should be a way to set up the namespacing and load the policy in a way
> that these will only take effect at next execve() call, much like
> setexeccon(). However, errors should be returned as early as possible
> though so that the error can be associated with the loading. Also it
> should be possible to enable SELinux namespacing independently to other
> namespacing options as they are controlled by other directives.
>
> Would this be an interesting use case? Would it need major design
> changes? Systemd already loads a SELinux policy at boot so there's some
> infrastructure in place.

I don't think there is anything in the current implementation that
would preclude such usage, but I'm not sure that's a major use case
for the SELinux namespace support - sounds more like you want to apply
Landlock or similar sandboxing via systemd configuration.

At present, the unshare operation is not deferred to the next
execve(), no different than any of the other namespace unshare
operations, but that's easy to do if it is necessary for some reason.
The current sequence as I've sketched in this email thread is to
unshare the SELinux namespace, mount your own private selinuxfs
instance that only affects your policy, load a policy, set enforcing
mode, and switch to an appropriate security context in the child -
either via setcon(3) or execve(). The policy and AVC are private to
your namespace. Permissions are checked against the current namespace
and all ancestors (for the checks that I have converted thus far,
still WIP). The process context in the child is separate/independent
of the context in the parent, but bounded in permissions by it.