Re: Race condition during policy load in kernel

Daniel Burgener <dburgener@xxxxxxxxxxxxxxxxxxx> · Wed, 29 Apr 2020 08:40:07 -0400

On 4/29/20 8:32 AM, Stephen Smalley wrote:
On Tue, Apr 28, 2020 at 4:38 PM Daniel Burgener
<dburgener@xxxxxxxxxxxxxxxxxxx> wrote:
On 4/28/20 3:34 PM, Ondrej Mosnacek wrote:
On Tue, Apr 28, 2020 at 8:54 PM Stephen Smalley
<stephen.smalley.work@xxxxxxxxx> wrote:
On Mon, Apr 27, 2020 at 4:40 PM Daniel Burgener
<dburgener@xxxxxxxxxxxxxxxxxxx> wrote:
Hello all,

We've noticed a few instances of strange failures in userspace object
managers happening immediately after a policy load, that we believe
we've traced to a race condition in the kernel, and wanted to get your
thoughts on our diagnosis and the appropriate fix before creating a
patch in case we've missed something.

The issue is that userspace object managers rely on /sys/fs/selinux to
determine the mapping of object class and permission strings to numeric
identifiers, however immediately after a policy load,
/sys/fs/selinux/{booleans,class,policy_capabilities}, are deleted and
recreated.  This leaves a window where the object classes (etc) appear
to be not defined, even if they are defined in both the old and new
policies.

We have observed this with both dbus and systemd userspace object
managers, and it can be reproduced straightforwardly by running the
following (borrowed from bug linked below):

# (while true; do cat /sys/fs/selinux/class/service/index >/dev/null;
done) &
# while true; do sudo load_policy; echo -n .; sleep 0.1;done

Periodically, we will get "No such file or directory" messages printed
to stderr.  In the event of a userspace object manager using libselinux
to check a userspace permission, that will result in a USER_AVC message
indicating an unknown object class, and in the event that
--handle-unknown is set to "deny", it will deny the access.

It seems to me as though some sort of locking should occur in the
selinuxfs to prevent access to the files being deleted and recreated
during the policy load, so that userspace programs relying on them (in
particular userspace object managers doing class lookups) get a
consistent view of the classes, perms, booleans and capabilities in the
loaded policy.

This seems to be related to
https://github.com/SELinuxProject/selinux-kernel/issues/42 but I believe
it is a different case.  The deadlock in that bug seems to be related to
the underlying filesystem functions, specifically around directory
deletion while this is an issue the selinuxfs logic specifically. The
above linked issue appears to have been fixed in recent upstream
kernels, per the bug, but I have verified the issue I am discussing here
in 5.7.0 rc3.

It seems to me as though from the perspective of userspace that all of
sel_make_policy_nodes (or at least all of each of its component
functions) should be atomic.  There was some discussion in a previous
thread
(https://lore.kernel.org/selinux/20181002155810.GP32577@xxxxxxxxxxxxxxxxxx/)
around a significant refactor of policy loading in general.  It appears
as though the direct issue there of access during the deletion has been
resolved (commit d4f4de5e5ef8efde85febb6876cd3c8ab1631999), although the
complete suggested fix of refactoring policy_load into two parts has not
been done.  Would that refactor be the right approach to the problem I
am trying to solve?  Would a patch for adding locking around the
selinuxfs delete/recreate operation be considered? That wouldn't address
all the concerns, (namely the potential to access a view of the policy
that doesn't match the currently loaded policy and error recovery in the
case that sel_make_nodes fails), but would improve the reliability of
existing userspace object managers

I'm happy to create and submit a patch, but I wanted to get the
communities thought on the problem and correct approach first.
I think the best approach is the one sketched by viro but I recognize
that's a big lift.
Willing to help but not entirely clear on the interactions with the
dcache for atomically swapping
the new trees into place.

I suspect this is just being exposed now due to more recent changes in
userspace to try to fully support
changes in class/permission values at runtime.  Previously userspace
object managers tended to only
map them during initialization or on first use and then would just
keep using the cached values, whereas
now they try to map them lazily and flush their caches on a reload
notification.  Some of those changes
were in libselinux and others in the userspace object managers (e.g.
systemd, dbus-broker or dbusd).
Not sure exactly what versions of libselinux, systemd, and
dbusd/dbus-broker you are using.

selinuxfs itself does take a mutex around the entire policy load
including the delete/re-create sequence but
that only serializes with other policy operations (load, read,
booleans), not with dcache walk/lookup.

cc'ing Ondrej since he attempted to fix the earlier selinuxfs bug and
may have an opinion on the best way
forward for this issue.
Well, I attempted a few times and each time failed miserably :)
Thankfully, hitting the bug via another fs eventually forced viro to
fix it (in a way I could probably never come up with myself). However,
I remember trying to originally fix the bug by means of making the
swapover atomic, but later I realized that these two are actually
independent issues. After that I didn't get back to atomizing the
swapover, but IIRC I had the impression that it might not be all that
difficult... (at least if I ignore possible failures during the new
dentry tree creation for now - it would still be unsafe, but it would
be a start). But knowing VFS, I bet when I actually try it will prove
to be much more tricky ;)

I'll try to find some time to sit down to it again, but at the moment
I'm juggling a bunch of higher priority stuff so it might be a while
before I get to it.

The weird thing about atomizing the swapover in isolation is that it's
unclear what action to take on failure.  The existing code looks like it
would leave a broken version of the selinuxfs laying around.  Completing
the swapover even on failure would give us an atomic version of the
current situation, but feels extra weird. If ignoring the failure case
as Ondrej suggests would be acceptable in a patch, that sounds like the
quickest way to addressing at least part of the problem (the refactor of
security_load_policy by itself doesn't get us much by itself, since we
can't roll back the selinuxfs after we've started in the current state.)
I don't entirely understand what you mean by the above, since viro's sketch
of a new policy load sequence does permit handling the error case in the middle
of the new selinuxfs directory/file creation by just throwing away the
new entries
altogether and leaving the old policydb/sidtab + old selinuxfs in
place.  That said
I'm not saying you have to do the whole thing at once.

My point on that was just that without building the new selinuxfs 
entries offline, we don't have the old selinuxfs anymore to restore to, 
since it's already been deleted.  I suppose we could rebuild it at that 
point though.  Viro's approach works great once we have everything, but 
just refactoring policy load into two parts as a first step without 
modifying the selinuxfs by itself I don't think gets us very much if 
anything.

I'd be happy to start looking at either half, although I'd prefer to
start with atomizing the tree swapover to solve my immediate problem if
that would have a chance at getting merged by itself given the issues
around the failure case.
Certainly willing to look at incremental improvements here.

I'll start looking into the atomic tree swapover first then.  If that 
goes well, I'll look into moving on to the policy load refactor.

-Daniel