Re: [RFC-PATCH 1/2] mm: Add __GFP_NO_LOCKS flag

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Fri, 14 Aug 2020 07:14:25 -0700

On Fri, Aug 14, 2020 at 10:30:37AM +0200, Peter Zijlstra wrote:
> On Fri, Aug 14, 2020 at 01:59:04AM +0200, Thomas Gleixner wrote:
> > On Fri, Aug 14 2020 at 00:06, peterz wrote:
> > > I'm still not getting it, how do we end up trying to allocate memory
> > > from under raw spinlocks if you're not allowed to use kfree_rcu() under
> > > one ?
> > >
> > > Can someone please spell out the actual problem?
> > 
> > Your actual problem is the heat wave. Get an icepack already :)
> 
> Sure, but also much of the below wasn't stated anywhere in the thread I
> got Cc'ed on. All I got was half arsed solutions without an actual
> problem statement.
> 
> > To set things straight:
> > 
> >   1) kfree_rcu() which used to be just a conveniance wrapper around
> >      call_rcu() always worked in any context except NMI.
> > 
> >      Otherwise RT would have never worked or would have needed other
> >      horrible hacks to defer kfree in certain contexts including
> >      context switch.
> 
> I've never used kfree_rcu(), htf would I know.

Doing this to kfree_rcu() is the first step.  We will also be doing this
to call_rcu(), which has some long-standing invocations from various
raw contexts, including hardirq handler.

> >   2) RCU grew an optimization for kfree_rcu() which avoids the linked
> >      list cache misses when a large number of objects is freed via
> >      kfree_rcu(). Paul says it speeds up post grace period processing by
> >      a factor of 8 which is impressive.
> > 
> >      That's done by avoiding the linked list and storing the object
> >      pointer in an array of pointers which is page sized. This array is
> >      then freed in bulk without having to touch the rcu head linked list
> >      which obviously avoids cache misses.
> > 
> >      Occasionally kfree_rcu() needs to allocate a page for this but it
> >      can fallback to the linked list when the allocation fails.
> >      
> >      Inconveniantly this allocation happens with an RCU internal raw
> >      lock held, but even if we could do the allocation outside the RCU
> >      internal lock this would not solve the problem that kfree_rcu() can
> >      be called in any context except NMI even on RT.
> > 
> >      So RT forbids allocations from truly atomic contexts even with
> >      GFP_ATOMIC/NOWAIT. GFP_ATOMIC is kinda meaningless on RT because
> >      everything which calls it is in non-atomic context :) But still
> >      GFP_ATOMIC or GFP_NOWAIT retain the semantics of !RT and do not go
> >      down into deeper levels or wait for pages to become available.
> 
> Right so on RT just cut out the allocation and unconditionally make it
> NULL. Since it needs to deal with GFP_ATOMIC|GFP_NOWARN failing the
> allocation anyway.

Except that this is not just RT due to CONFIG_PROVE_RAW_LOCK_NESTING=y.

> >   3) #2 upsets lockdep (with the raw lock nesting enabled) rightfully
> >      when RCU tries to allocate a page, the lockless page cache is empty
> >      and it acquires zone->lock which is a regular spinlock
> 
> There was no lockdep splat in the thread.

I don't see the point of including such a splat given that we know that
lockdep is supposed to splat in this situation.

> >   4) As kfree_rcu() can be used from any context except NMI and RT
> >      relies on it we ran into a circular dependency problem.
> 
> Well, which actual usage sites are under a raw spinlock? Most of the
> ones I could find are not.

There are some on their way in, but this same optimization will be applied
to call_rcu(), and there are no shortage of such call sites in that case.
And these call sites have been around for a very long time.

> >      If we disable that feature for RT then the problem would be solved
> >      except that lockdep still would complain about taking zone->lock
> >      within a forbidden context on !RT kernels.
> > 
> >      Preventing that feature because of that is not a feasible option
> >      either. Aside of that we discuss this postfactum, IOW the stuff is
> >      upstream already despite the problem being known before.
> 
> Well, that's a fail :-( I tought we were supposed to solve known issues
> before shit got merged.

The enforcement is coming in and the kfree_rcu() speed up is coming in
at the same time.  The call_rcu() speedup will appear later.

> >   5) Ulad posted patches which introduce a new GFP allocation mode
> >      which makes the allocation fail if the lockless cache is empty,
> >      i.e. it does not try to touch zone->lock in that case.
> > 
> >      That works perfectly fine on RT and !RT, makes lockdep happy and
> >      Paul is happy as well.
> > 
> >      If the lockless cache, which works perfectly fine on RT, is empty
> >      then the performance of kfree_rcu() post grace period processing is
> >      probably not making the situation of the system worse.
> > 
> >      Michal is not fond of the new GFP flag and wants to explore options
> >      to avoid that completely. But so far there is no real feasible
> >      solution.
> 
> Not only Michal, those patches looked pretty horrid.

They are the first round, and will be improved.

> >      A) It was suggested to make zone->lock raw, but that's a total
> >         disaster latency wise. With just a kernel compile (!RT kernel)
> >         spinwait times are in the hundreds of microseconds.
> 
> Yeah, I know, been there done that, like over a decade ago :-)
> 
> >      B) Michal suggested to have GFP_RT_ATOMIC defined to 0 which would
> >         not require a new GFP bit and bail out when RT is enabled and
> >         the context is atomic.
> 
> I would've written the code like:
> 
> 	void *mem = NULL;
> 
> 	...
> #ifndef CONFIG_PREEMPT_RT
> 	mem = kmalloc(PAGE_SIZE, GFP_ATOMIC|GFP_NOWAIT);
> #endif
> 	if (!mem)
> 	  ....
> 
> Seems much simpler and doesn't pollute the GFP_ space.

But fails in the CONFIG_PROVE_RAW_LOCK_NESTING=y case when lockdep
is enabled.

> >      D) To solve the lockdep issue of #B Michal suggested to have magic
> >         lockdep annotations which allow !RT kernels to take zone->lock
> >         from the otherwise forbidden contexts because on !RT this lock
> >         nesting could be considered a false positive.
> > 
> >         But this would be horrors of some sorts because there might be
> >         locks further down which then need the same treatment or some
> >         general 'yay, I'm excempt from everything' annotation which is
> >         short of disabling lockdep completly.
> > 
> >         Even if that could be solved and made correct for both RT and
> >         !RT then this opens the can of worms that this kind of
> >         annotation would show up all over the place within no time for
> >         the completely wrong reasons.
> 
> So due to this heat crap I've not slept more than a few hours a night
> for the last week, I'm not entirely there, but we already have a bunch
> of lockdep magic for this raw nesting stuff.

Ouch!  Here is hoping for cooler weather soon!

> But yes, we need to avoid abuse, grep for lockdep_off() :-( This
> drivers/md/dm-cache-target.c thing is just plain broken. It sorta
> 'works' on mainline (and even there it can behave badly), but on RT it
> will come apart with a bang.
> 
> > Paul compiled this options list:
> > 
> > > 1.	Prohibit invoking allocators from raw atomic context, such
> > >	as when holding a raw spinlock.
> > 
> >   Clearly the simplest solution but not Pauls favourite and
> >   unfortunately he has a good reason.
> 
> Which isn't actually stated anywhere I suppose ?

Several times, but why not one more?  ;-)

In CONFIG_PREEMPT_NONE=y kernels, which are heavily used, and for which
the proposed kfree_rcu() and later call_rcu() optimizations are quite
important, there is no way to tell at runtime whether or you are in
atomic raw context.

> > > 2.	Adding a GFP_ flag.
> > 
> >   Michal does not like the restriction for !RT kernels and tries to
> >   avoid the introduction of a new allocation mode.
> 
> Like above, I tend to be with Michal on this, just wrap the actual
> allocation in CONFIG_PREEMPT_RT, the code needs to handle a NULL pointer
> there anyway.

That disables the optimization in the CONFIG_PREEMPT_NONE=y case,
where it really is needed.

> > > 3.	Reusing existing GFP_ flags/values/whatever to communicate
> > >	the raw-context information that was to be communicated by
> > >	the new GFP_ flag.
> > >
> > > 4.	Making lockdep forgive acquiring spinlocks while holding
> > >	raw spinlocks, but only in CONFIG_PREEMPT_NONE=y kernels.
> 
> Uhh, !CONFIG_PREEMPT_RT, the rest is 'fine'.

I would be OK with either.  In CONFIG_PREEMPT_NONE=n kernels, the
kfree_rcu() code could use preemptible() to determine whether it was safe
to invoke the allocator.  The code in kfree_rcu() might look like this:

	mem = NULL;
	if (IS_ENABLED(CONFIG_PREEMPT_NONE) || preemptible())
		mem = __get_free_page(...);

Is your point is that the usual mistakes would then be caught by the
usual testing on CONFIG_PREEMPT_NONE=n kernels?

> >  These are not seperate of each other as #3 requires #4. The most
> >  horrible solution IMO from a technical POV as it proliferates
> >  inconsistency for no good reaosn.
> > 
> >  Aside of that it'd be solving a problem which does not exist simply
> >  because kfree_rcu() does not depend on it and we really don't want to
> >  set precedence and encourage the (ab)use of this in any way.
> 
> My preferred solution is 1, if you want to use an allocator, you simply
> don't get to play under raw_spinlock_t. And from a quick grep, most
> kfree_rcu() users are not under raw_spinlock_t context.

There is at least one on its way in, but more to the point, we will
need to apply this same optimization to call_rcu(), which is used in
raw atomic context, including from hardirq handlers.

> This here sounds like someone who wants to have his cake and eat it too.

Just looking for a non-negative sized slice of cake, actually!

> I'll try and think about a lockdep annotation that isn't utter crap, but
> that's probably next week, I need this heat to end and get a few nights
> sleep, I'm an utter wreck atm.

Again, here is hoping for cooler weather!

							Thanx, Paul