Re: [BUG] Race between policy reload sidtab conversion and live conversion

Ondrej Mosnacek <omosnace@xxxxxxxxxx> · Mon, 1 Mar 2021 11:35:56 +0100

On Sun, Feb 28, 2021 at 8:21 PM Paul Moore <paul@xxxxxxxxxxxxxx> wrote:
> On Fri, Feb 26, 2021 at 6:12 AM Ondrej Mosnacek <omosnace@xxxxxxxxxx> wrote:
> > On Fri, Feb 26, 2021 at 2:07 AM Paul Moore <paul@xxxxxxxxxxxxxx> wrote:
> > > On Wed, Feb 24, 2021 at 4:35 AM Ondrej Mosnacek <omosnace@xxxxxxxxxx> wrote:
> > > > After the switch to RCU, we now have:
> > > > 1. Start live conversion of new entries.
> > > > 2. Convert existing entries.
> > > > 3. RCU-assign the new policy pointer to selinux_state.
> > > > [!!! Now actually both old and new sidtab may be referenced by
> > > > readers, since there is no synchronization barrier previously provided
> > > > by the write lock.]
> > > > 4. Wait for synchronize_rcu() to return.
> > > > 5. Now only the new sidtab is visible to readers, so the old one can
> > > > be destroyed.
> > > >
> > > > So the race can happen between 3. and 5., if one thread already sees
> > > > the new sidtab and adds a new entry there, and a second thread still
> > > > has the reference to the old sidtab and also tires to add a new entry;
> > > > live-converting to the new sidtab, which it doesn't expect to change
> > > > by itself. Unfortunately I failed to realize this when reviewing the
> > > > patch :/
> > >
> > > It is possible I'm not fully understanding the problem and/or missing
> > > an important detail - it is rather tricky code, and RCU can be very
> > > hard to reason at times - but I think we may be able to solve this
> > > with some lock fixes inside sidtab_context_to_sid().  Let me try to
> > > explain to see if we are on the same page here ...
> > >
> > > The problem is when we have two (or more) threads trying to
> > > add/convert the same context into a sid; the task with new_sidtab is
> > > looking to add a new sidtab entry, while the task with old_sidtab is
> > > looking to convert an entry in old_sidtab into a new entry in
> > > new_sidtab.  Boom.
> > >
> > > Looking at the code in sidtab_context_to_sid(), when we have two
> > > sidtabs that are currently active (old_sidtab->convert pointer is
> > > valid) and a task with old_sidtab attempts to add a new entry to both
> > > sidtabs it first adds it to the old sidtab then it also adds it to the
> > > new sidtab.  I believe the problem is that in this case while the task
> > > grabs the old_sidtab->lock, it never grabs the new_sidtab->lock which
> > > allows it to race with tasks that already see only new_sidtab.  I
> > > think adding code to sidtab_context_to_sid() which grabs the
> > > new_sidtab->lock when adding entries to the new_sidtab *should* solve
> > > the problem.
> > >
> > > Did I miss something important? ;)
> >
> > Sadly, yes :) Consider this scenario (assuming we fix the locking at
> > sidtab level):
> >
> > If it happens that a new SID (x) is added via the new sidtab and then
> > another one (y) via the old sidtab, to avoid clash of SIDs, we would
> > need to leave a "hole" in the old sidtab for SID x. And this will
> > cause trouble if the thread that has just added SID y, then tries to
> > translate the context string corresponding to SID x (without re-taking
> > the RCU read lock and refreshing the policy pointer). Even if we
> > handle skipping the "holes" in the old sidtab safely, the translation
> > would then end up adding a duplicate SID entry for the context already
> > represented by SID x - which is not a state we want to end up in.
>
> Ah, yes, you're right.  I was only thinking about the problem of
> adding an entry to the old sidtab, and not the (much more likely case)
> of an entry being added to the new sidtab.  Bummer.
>
> Thinking aloud for a moment - what if we simply refused to add new
> sidtab entries if the task's sidtab pointer is "old"?  Common sense
> would tell us that this scenario should be very rare at present, and I
> believe the testing mentioned in this thread adds some weight to that
> claim.  After all, this only affects tasks which entered into their
> RCU protected session prior to the policy load RCU sync *AND* are
> attempting to add a new entry to the sidtab.  That *has* to be a
> really low percentage, especially on a system that has been up and
> running for some time.  My gut feeling is this should be safe as well;
> all of the calling code should have the necessary error handling in
> place as there are plenty of reasons why we could normally fail to add
> an entry to the sidtab; memory allocation failures being the most
> obvious failure point I would suspect.  This obvious downside to such
> an approach is that those operations which do meet this criteria would
> fail - and we should likely emit an error in this case - but is this
> failure really worse than any other transient kernel failure,

No, I don't like this approach at all. Before the sidtab refactor, it
had been done exactly this way - ENOMEM was returned while the sidtab
was "frozen" (i.e. while the existing entries were being converted).
And this was a real nuisance because things would fail randomly during
policy reload. And it's not just unimportant explicit userspace
actions that can fail. Any kind of transition can lead to a new SID
being created and you'd get things like execve(), mkdir(), ... return
-ENOMEM sometimes. (With a low probability, but still...)

I wouldn't compare it to a memory allocation failure, which normally
starts happening only when the system becomes overloaded. Here the
user would *awlays* have some probability of getting this error, and
they couldn't do anything about it.

> and is
> attempting to mitigate this failure worth abandoning the RCU approach
> for the sidtab?

Perhaps it wasn't clear from what I wrote, but I certainly don't want
to abandon it completely. Just to revert to a safe state until we
figure out how to do the RCU policy reload safely. The solution with
two-way conversion seems doable, it's just not a quick and easy fix.

--
Ondrej Mosnacek
Software Engineer, Linux Security - SELinux kernel
Red Hat, Inc.