Re: [PATCH v5] selinux: sidtab: reverse lookup hash table

Ondrej Mosnacek <omosnace@xxxxxxxxxx> · Thu, 7 Nov 2019 19:41:31 +0100

On Thu, Nov 7, 2019 at 5:44 PM Will Deacon <will@xxxxxxxxxx> wrote:
> On Thu, Nov 07, 2019 at 05:12:53PM +0100, Ondrej Mosnacek wrote:
> > On Thu, Nov 7, 2019 at 11:17 AM Jeff Vander Stoep <jeffv@xxxxxxxxxx> wrote:
> > > This replaces the reverse table lookup and reverse cache with a
> > > hashtable which improves cache-miss reverse-lookup times from
> > > O(n) to O(1)* and maintains the same performance as a reverse
> > > cache hit.
> > >
> > > This reduces the time needed to add a new sidtab entry from ~500us
> > > to 5us on a Pixel 3 when there are ~10,000 sidtab entries.
> > >
> > > The implementation uses the kernel's generic hashtable API,
> > > It uses the context's string represtation as the hash source,
> > > and the kernels generic string hashing algorithm full_name_hash()
> > > to reduce the string to a 32 bit value.
> > >
> > > This change also maintains the improvement introduced in
> > > commit ee1a84fdfeed ("selinux: overhaul sidtab to fix bug and improve
> > > performance") which removed the need to keep the current sidtab
> > > locked during policy reload. It does however introduce periodic
> > > locking of the target sidtab while converting the hashtable. Sidtab
> > > entries are never modified or removed, so the context struct stored
> > > in the sid_to_context tree can also be used for the context_to_sid
> > > hashtable to reduce memory usage.
> > >
> > > This bug was reported by:
> > > - On the selinux bug tracker.
> > >   BUG: kernel softlockup due to too many SIDs/contexts #37
> > >   https://github.com/SELinuxProject/selinux-kernel/issues/37
> > > - Jovana Knezevic on Android's bugtracker.
> > >   Bug: 140252993
> > >   "During multi-user performance testing, we create and remove users
> > >   many times. selinux_android_restorecon_pkgdir goes from 1ms to over
> > >   20ms after about 200 user creations and removals. Accumulated over
> > >   ~280 packages, that adds a significant time to user creation,
> > >   making perf benchmarks unreliable."
> > >
> > > * Hashtable lookup is only O(1) when n < the number of buckets.
> > >
> > > Changes in V2:
> > > -The hashtable uses sidtab_entry_leaf objects directly so these
> > > objects are shared between the sid_to_context lookup tree and the
> > > context_to_sid hashtable. This simplifies memory allocation and
> > > was suggested by Ondrej Mosnacek.
> > > -The new sidtab hash stats file in selinuxfs has been moved out of
> > > the avc dir and into a new "ss" dir.
> > >
> > > V3:
> > > -Add lock nesting notation.
> > >
> > > V4:
> > > -Moved to *_rcu variants of the various hashtable functions
> > > as suggested by Ondrej Mosnacek and Will Deacon.
> >
> > I may be misunderstanding the purpose of RCU, but isn't all this RCU
> > stuff a big overkill when these entries are never removed? (Well, they
> > are removed when the sidtab is being destroyed, at which point however
> > no other threads are accessing them any more.) In my understanding,
> > RCU basically makes sure that objects that you dereference in an RCU
> > critical section are not destroyed until you leave it. Yes, it also
> > helps you to dereference/update them safely, but that can be achieved
> > on its own in a simpler way. Instead of using RCU here, I would
> > instead suggest looking into adding an equivalent of
> > list_for_each_entry_lockless()* for hlist and maybe introduce some
> > suitable hlist_add_something() function that ensures consistency
> > w.r.t. the lockless traversal (perhaps the WRITE_ONCE() in hlist_add()
> > is already sufficient, but I'm not sure...).
>
> If you use the existing _rcu accessors you will get correctly enforced
> dependency ordering on the reader side and correctly placed release
> ordering on the updater side. I don't think that's a big overkill, and
> you can use the RCU accessors to achieve the lockless traversal.

OK, but you don't need the read_lock()/unlock(), rcu_head field in the
entries, and kfree_rcu(), right?

>
> hlist_add() is not safe against a concurrent traversal because the
> WRITE_ONCE() provides no memory ordering guarantees, allowing the readers
> to see an uninitialised node.

Right, so we would need a new function that does smp_store_release() instead.

>
> How exactly would list_for_each_entry_lockless() and hlist_add_something()
> differ from the RCU variants, assuming they're implemented correctly?

Looking at the implementation of rcu_dereference() and
rcu_assign_pointer(), they would probably be practically the same,
except the rcu_read_lock_held() check in rcu_dereference(). That and
they would clearly communicate that they are not doing actual RCU, but
just allow lockless traversal of an add-only hlist.

--
Ondrej Mosnacek <omosnace at redhat dot com>
Software Engineer, Security Technologies
Red Hat, Inc.