Re: [RFC-PATCH 1/2] mm: Add __GFP_NO_LOCKS flag

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Tue, 11 Aug 2020 14:09:31 -0700

On Tue, Aug 11, 2020 at 09:39:10PM +0200, Thomas Gleixner wrote:
> Thomas Gleixner <tglx@xxxxxxxxxxxxx> writes:
> > "Paul E. McKenney" <paulmck@xxxxxxxxxx> writes:
> >> On Tue, Aug 11, 2020 at 04:44:21PM +0200, Thomas Gleixner wrote:
> >>> Now RCU creates a new thing which enforces to make page allocation in
> >>> atomic context possible on RT. What for?
> >>> 
> >>> What's the actual use case in truly atomic context for this new thing on
> >>> an RT kernel?
> >>
> >> It is not just RT kernels.  CONFIG_PROVE_RAW_LOCK_NESTING=y propagates
> >> this constraint to all configurations, and a patch in your new favorite
> >> subsystem really did trigger this lockdep check in a non-RT kernel.
> >>
> >>> The actual RCU code disabling interrupts is an implementation detail
> >>> which can easily be mitigated with a local lock.
> >>
> >> In this case, we are in raw-spinlock context on entry to kfree_rcu().
> >
> > Where?
> 
> And aside of the where, wasn't kfree_rcu() from within raw spinlock held
> regions possible all the time? Either I'm missing something or you are
> fundamentally changing RCU internals. kfree_rcu() saved RT in various
> ways where invoking kfree() was just not an option. Confused...

Back in the old days (months ago!), it was possible to invoke kfree_rcu()
while holding a raw spinlock because kfree_rcu() didn't acquire any locks.
It didn't need to because it was just a wrapper around call_rcu().
And call_rcu(), along with the rest of RCU's internals, has used raw
spinlocks since near the beginnings of RT, which meant that it wasn't
a problem in the rare cases where call_rcu() needed to acquire one of
RCU's locks (for example, when call_rcu() sees that the current CPU has
accumulated more than 10,000 callbacks).

But one problem with the old kfree_rcu() approach is that the memory to
be freed is almost always cold in the cache by the time that the grace
period ends.  And this was made worse by the fact that the rcu_do_batch()
function traverses the list of objects to be freed as a linked list,
thus incurring a cache miss on each and every object.  The usual fix
for this sort of performance problem is to use arrays of pointers,
which on a 64-bit system with 64-byte cache lines reduces the number of
cache misses by a factor of eight.  In addition, decreasing the number
of post-grace-period cache misses increases the stability of RCU with
respect to callback flooding:  Because the kfree()s happen faster, it
is harder to overrun RCU with tight loops posting callbacks (as happened
some time back with certain types of ACLs).

Hence Ulad's work on kfree_rcu().  The approach is to allocate a
page-sized array to hold all the pointers, then fill in the rest of these
pointers on each subsequent kfree_rcu() call.  These arrays of pointers
also allows use of kfree_bulk() instead of kfree(), which can improve
performance yet more.  It is no big deal if kfree_rcu()'s allocation
attempts fail occasionally because it can simply fall back to the old
linked-list approach.  And given that the various lockless caches in
the memory allocator are almost never empty, in theory life is good.

But in practice, mainline now has CONFIG_PROVE_RAW_LOCK_NESTING,
and for good reason -- this Kconfig option makes it at least a
little bit harder for mainline developers to mess up RT.  But with
CONFIG_PROVE_RAW_LOCK_NESTING=y and lockdep enabled, mainline will now
sometimes complain if you invoke kfree_rcu() while holding a raw spinlock.
This happens when kfree_rcu() needs to invoke the memory allocator and
the memory allocator's caches are empty, thus resulting in the memory
allocator attempting to acquire a non-raw spinlock.

Because kfree_rcu() has a fallback available (just using the old linked
list), kfree_rcu() would work well given a way to tell the memory
allocator to return NULL instead of acquiring a non-raw spinlock.
Which is exactly what Ulad's recent patches are intended to do.

Since then, this thread has discussed various other approaches,
including using existing combinations of GFP_ flags, converting
the allocator's zone lock to a raw spinlock, and so on.

Does that help, or am I missing the point of your question?

							Thanx, Paul