On Sun, Feb 28, 2021 at 8:21 PM Paul Moore <paul@xxxxxxxxxxxxxx> wrote: > On Fri, Feb 26, 2021 at 6:12 AM Ondrej Mosnacek <omosnace@xxxxxxxxxx> wrote: > > On Fri, Feb 26, 2021 at 2:07 AM Paul Moore <paul@xxxxxxxxxxxxxx> wrote: > > > On Wed, Feb 24, 2021 at 4:35 AM Ondrej Mosnacek <omosnace@xxxxxxxxxx> wrote: > > > > After the switch to RCU, we now have: > > > > 1. Start live conversion of new entries. > > > > 2. Convert existing entries. > > > > 3. RCU-assign the new policy pointer to selinux_state. > > > > [!!! Now actually both old and new sidtab may be referenced by > > > > readers, since there is no synchronization barrier previously provided > > > > by the write lock.] > > > > 4. Wait for synchronize_rcu() to return. > > > > 5. Now only the new sidtab is visible to readers, so the old one can > > > > be destroyed. > > > > > > > > So the race can happen between 3. and 5., if one thread already sees > > > > the new sidtab and adds a new entry there, and a second thread still > > > > has the reference to the old sidtab and also tires to add a new entry; > > > > live-converting to the new sidtab, which it doesn't expect to change > > > > by itself. Unfortunately I failed to realize this when reviewing the > > > > patch :/ > > > > > > It is possible I'm not fully understanding the problem and/or missing > > > an important detail - it is rather tricky code, and RCU can be very > > > hard to reason at times - but I think we may be able to solve this > > > with some lock fixes inside sidtab_context_to_sid(). Let me try to > > > explain to see if we are on the same page here ... > > > > > > The problem is when we have two (or more) threads trying to > > > add/convert the same context into a sid; the task with new_sidtab is > > > looking to add a new sidtab entry, while the task with old_sidtab is > > > looking to convert an entry in old_sidtab into a new entry in > > > new_sidtab. Boom. > > > > > > Looking at the code in sidtab_context_to_sid(), when we have two > > > sidtabs that are currently active (old_sidtab->convert pointer is > > > valid) and a task with old_sidtab attempts to add a new entry to both > > > sidtabs it first adds it to the old sidtab then it also adds it to the > > > new sidtab. I believe the problem is that in this case while the task > > > grabs the old_sidtab->lock, it never grabs the new_sidtab->lock which > > > allows it to race with tasks that already see only new_sidtab. I > > > think adding code to sidtab_context_to_sid() which grabs the > > > new_sidtab->lock when adding entries to the new_sidtab *should* solve > > > the problem. > > > > > > Did I miss something important? ;) > > > > Sadly, yes :) Consider this scenario (assuming we fix the locking at > > sidtab level): > > > > If it happens that a new SID (x) is added via the new sidtab and then > > another one (y) via the old sidtab, to avoid clash of SIDs, we would > > need to leave a "hole" in the old sidtab for SID x. And this will > > cause trouble if the thread that has just added SID y, then tries to > > translate the context string corresponding to SID x (without re-taking > > the RCU read lock and refreshing the policy pointer). Even if we > > handle skipping the "holes" in the old sidtab safely, the translation > > would then end up adding a duplicate SID entry for the context already > > represented by SID x - which is not a state we want to end up in. > > Ah, yes, you're right. I was only thinking about the problem of > adding an entry to the old sidtab, and not the (much more likely case) > of an entry being added to the new sidtab. Bummer. > > Thinking aloud for a moment - what if we simply refused to add new > sidtab entries if the task's sidtab pointer is "old"? Common sense > would tell us that this scenario should be very rare at present, and I > believe the testing mentioned in this thread adds some weight to that > claim. After all, this only affects tasks which entered into their > RCU protected session prior to the policy load RCU sync *AND* are > attempting to add a new entry to the sidtab. That *has* to be a > really low percentage, especially on a system that has been up and > running for some time. My gut feeling is this should be safe as well; > all of the calling code should have the necessary error handling in > place as there are plenty of reasons why we could normally fail to add > an entry to the sidtab; memory allocation failures being the most > obvious failure point I would suspect. This obvious downside to such > an approach is that those operations which do meet this criteria would > fail - and we should likely emit an error in this case - but is this > failure really worse than any other transient kernel failure, No, I don't like this approach at all. Before the sidtab refactor, it had been done exactly this way - ENOMEM was returned while the sidtab was "frozen" (i.e. while the existing entries were being converted). And this was a real nuisance because things would fail randomly during policy reload. And it's not just unimportant explicit userspace actions that can fail. Any kind of transition can lead to a new SID being created and you'd get things like execve(), mkdir(), ... return -ENOMEM sometimes. (With a low probability, but still...) I wouldn't compare it to a memory allocation failure, which normally starts happening only when the system becomes overloaded. Here the user would *awlays* have some probability of getting this error, and they couldn't do anything about it. > and is > attempting to mitigate this failure worth abandoning the RCU approach > for the sidtab? Perhaps it wasn't clear from what I wrote, but I certainly don't want to abandon it completely. Just to revert to a safe state until we figure out how to do the RCU policy reload safely. The solution with two-way conversion seems doable, it's just not a quick and easy fix. -- Ondrej Mosnacek Software Engineer, Linux Security - SELinux kernel Red Hat, Inc.