On Mon, Sep 30, 2024 at 2:12 PM Topi Miettinen <toiwoton@xxxxxxxxx> wrote: > > Hi, > > I wonder if SELinux namespaces could be used for sandboxing, > specifically with systemd. When enabled for a service with a directive > (something like NamespacedSELinuxPolicy=path), PID1 could load a service > specific namespaced policy and apply it to the service as it starts. > These kind of policies could be extremely minimal and hardened when > optimized. > > The implementation should avoid interfering with other sandboxing > activities and also avoid AVC pollution from them, so preferably there > should be a way to set up the namespacing and load the policy in a way > that these will only take effect at next execve() call, much like > setexeccon(). However, errors should be returned as early as possible > though so that the error can be associated with the loading. Also it > should be possible to enable SELinux namespacing independently to other > namespacing options as they are controlled by other directives. > > Would this be an interesting use case? Would it need major design > changes? Systemd already loads a SELinux policy at boot so there's some > infrastructure in place. I don't think there is anything in the current implementation that would preclude such usage, but I'm not sure that's a major use case for the SELinux namespace support - sounds more like you want to apply Landlock or similar sandboxing via systemd configuration. At present, the unshare operation is not deferred to the next execve(), no different than any of the other namespace unshare operations, but that's easy to do if it is necessary for some reason. The current sequence as I've sketched in this email thread is to unshare the SELinux namespace, mount your own private selinuxfs instance that only affects your policy, load a policy, set enforcing mode, and switch to an appropriate security context in the child - either via setcon(3) or execve(). The policy and AVC are private to your namespace. Permissions are checked against the current namespace and all ancestors (for the checks that I have converted thus far, still WIP). The process context in the child is separate/independent of the context in the parent, but bounded in permissions by it.