We've recently noticed an issue with how systemd handles SELinux
labeling for sockets.
In the common case, systemd checks the label of the binary it expects to
execute, then calls security_compute_create_raw() to determine the label
of the process it will create, and applies that label to the socket
using setsockcreatecon(). This makes sense as it matches the label the
socket would get if the process created it itself.
However, when certain systemd directives are set, such as RootImage= or
ExtensionImage=, systemd simply skips the above behavior and creates
sockets without any special labeling handling, so they inherit the label
of systemd (typically init_t):
https://github.com/systemd/systemd/blob/13a42b776db9f4bd1e827091b6640801c54304e0/src/core/service.c#L5483-L5486
The result is that socket labels end up either with the label or the
process or the label of systemd, based on unrelated systemd directive
changes. Additionally, the init_t label prevents policy authors from
controlling access granularly on these sockets.
On most upstream policies this ends up working functionally. Fedora
added a "temporary" workaround to allow the init_t access for all init_t
daemons back in 2010 and never removed it:
https://github.com/fedora-selinux/selinux-policy/blob/8dfcddb1f7227bbdf98776f795be53cf50734b04/policy/modules/system/init.te#L604-L605
That workaround accidentally got pulled into refpolicy in a large block
of systemd changes back in 2017:
https://github.com/SELinuxProject/refpolicy/blob/6e54a2eda6f493c585a3fc59e8ddc54f341dbf0c/policy/modules/system/init.te#L1600-L1601
So in practice a lot of upstream policies are allowing access either
way, preventing functional issues.
We've spoken with a few systemd maintainers internally and they have
indicated that there is a fundamental timing issue with the current
approach - there are use cases where the socket must be available prior
to the image that contains the binary, so determining the label of the
binary prior to socket creation is impossible.
* The current approach of applying the label of the resulting process
seems impossible to do in all cases from a systemd perspective
* Reading the expected binary label from the file_contexts would avoid
the timing issue, but assumes a system where the binary labels generally
match the file_contexts
* Inheriting the init_t label prevents security enforcement across
different systemd created sockets, and conflates IPC with systemd with
IPC with systemd spawned processes
* Setting some other static label for all sockets avoids the conflation
between systemd and its children, but not between various children
* Checking the file_contexts for the path of the socket makes a lot of
sense in the case where sockets have paths, but systemd supports
creating sockets without paths such as abstract unix sockets (for example)
* Using the SELinuxContext= systemd directive causes systemd to use that
label for the resulting process (and therefore socket), so it skips
checking the binary and socket labeling works. However, this scatters
policy details across unit files, and doesn't permit decoupling unit
files and policy. Not to mention that it's unintuitive to expect anyone
to know that when they use RootImage= or ExtensionImage= they must also
use SELinuxContext= or their sockets will be mislabeled.
We're curious for the communities thoughts here. Any ideas or
suggestions for how we might address this situation?
-Daniel