Re: [PATCH 1/1] rcu/tree: add emergency pool for headless case

Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> · Mon, 6 Apr 2020 21:47:05 -0400

On Mon, Apr 06, 2020 at 06:17:08PM +0200, Uladzislau Rezki wrote:
> On Mon, Apr 06, 2020 at 11:18:51AM -0400, Joel Fernandes wrote:
> > Hi Vlad,
> > 
> > On Mon, Apr 06, 2020 at 02:56:40PM +0200, Uladzislau Rezki wrote:
> > > Hello, Joel.
> > > 
> > > > > > 
> > > > > > Hi Vlad,
> > > > > > 
> > > > > > One concern I have is this moves the problem a bit further down. My belief is
> > > > > > we should avoid the likelihood of even needing an rcu_head allocated for the
> > > > > > headless case, to begin with - than trying to do damage-control when it does
> > > > > > happen. The only way we would end up needing an rcu_head is if we could not
> > > > > > allocate an array.
> > > > > > 
> > > > > Let me share my view on all such caching. I think that now it becomes less as
> > > > > the issue, because of we have now https://lkml.org/lkml/2020/4/2/383 patch.
> > > > > I see that it does help a lot. I tried to simulate low memory condition and 
> > > > > apply high memory pressure with that. I did not manage to trigger the
> > > > > "synchronize rcu" path at all. It is because of using much more permissive
> > > > > parameters when we request a memory from the SLAB(direct reclaim, etc...).
> > > > 
> > > > That's a good sign that we don't hit this path in your tests.
> > > > 
> > > Just one request, of course if you have a time :)
> > > Could you please double check on your test environment to stress the system
> > > to check if you also can not hit it?
> > 
> > Sure, I am planning to do so and happy to spend time on it :) One question I
> > had about the below test:
> > 
> > > How i test it. Please apply below patch:
> > > <snip>
> > > t a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 5e26145e9ead..25f7ac8583e1 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -3203,6 +3203,7 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> > >  
> > >         if (head) {
> > >                 ptr = (void *) head - (unsigned long) func;
> > > +               head = NULL;
> > >         } else {
> > >                 /*
> > >                  * Please note there is a limitation for the head-less
> > > @@ -3233,16 +3234,18 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> > >          * Under high memory pressure GFP_NOWAIT can fail,
> > >          * in that case the emergency path is maintained.
> > >          */
> > > -       success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr);
> > > -       if (!success) {
> > > +       /* success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr); */
> > 
> > If you drop this, then it is not realistic right? I mean it changes behavior
> > of the code completely. We need to try to allocate array and then try to
> > allocate the head.
> > 
> That just bypasses an allocation for the array, to make it more simple
> and move forward toward the path we would like to test. Also head is
> set to NULL to simulated headless freeing.

Makes sense, I know you are forcing code to invoke the bad case more often
but I was concerned the hack would change dynamics of code enough to make it
an unrealistic situation. But I see your point.

> > > +       /* if (!success) { */
> > >                 /* Is headless object? */
> > >                 if (head == NULL) {
> > >                         /* Drop the lock. */
> > >                         krc_this_cpu_unlock(krcp, flags);
> > >  
> > >                         head = attach_rcu_head_to_object(ptr);
> > > -                       if (head == NULL)
> > > +                       if (head == NULL) {
> > > +                               success = false;
> > >                                 goto inline_return;
> > > +                       }
> > >  
> > >                         /* Take it back. */
> > >                         krcp = krc_this_cpu_lock(&flags);
> > > @@ -3267,7 +3270,7 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> > >                  */
> > >                 expedited_drain = true;
> > >                 success = true;
> > > -       }
> > > +       /* } */
> > >  
> > >         WRITE_ONCE(krcp->count, krcp->count + 1);
> > >  
> > > @@ -3297,7 +3300,9 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> > >                 if (!rcu_kfree_nowarn)
> > >                         WARN_ON_ONCE(1);
> > >                 debug_rcu_head_unqueue(ptr);
> > > -               synchronize_rcu();
> > > +               /* synchronize_rcu(); */
> > > +               printk(KERN_ERR "-> hit synchronize_rcu() path.\n");
> > > +               trace_printk("-> hit synchronize_rcu() path.\n");
> > >                 kvfree(ptr);
> > >         }
> > >  }
> > > <snip>
> > > 
> > > lower the memory size and run kfree rcu tests. It would be appreciated.
> > 
> > I am happy to try out the diff if I can understand how the above diff is
> > close enough with current code's behavior, if we are not using the array. One
> > other issue with current kfree rcu tests is, the test is itself the reason
> > for the pressure -- I believe we should also have some testing that shows
> > that the memory pressure is caused else where (such as a real user workload
> > causing OOM), and then we see how RCU behaves under OOM -- if we have too
> > many synchronous latencies, does the additional caching remove such latenies
> > under OOM? etc.  I also want to look into your vmalloc tests.
> > 
> Of course to have real tests would be good. 

Agreed.

> > > > I guess also, with your latest patch on releasing the lock to be in a
> > > > non-atomic context, and then doing the allocation, it became even more
> > > > permissive? If you drop that patch and tried, do you still not hit the
> > > > synchronous path more often?
> > > > 
> > > Yep. If i drop the patch, i can hit it.
> > 
> > Ah, cool. So basically the direct-reclaim path does the synchronous waiting,
> > instead of synchronize_rcu(). Either way, we wait synchronously. How to chose
> > which way is better though? If direct reclaim improves the memory situation,
> > then we should enter that path. But if direct reclaim takes too much time
> > (thus hurting the kfree_rcu() latency), then perhaps it is better for
> > kfree_rcu() to just do the synchronize_rcu() and let someone else enter the
> > direct-reclaim path. We should probably quantify and see which approach works
> > better.
> > 
> I see at it like, headless variant has to be called from the sleeping context, 
> therefore it can sleep. What is better to call synchronize_rcu() or doing direct
> reclaim depends on how many CPUs in a system we have. I suspect that doing
> direct reclaim is better, at least it will free some memory for us. We
> could also extend that patch and make it a bit different, for example do
> NOWAIT then try ATOMIC and as a last step do GFP_KERNEL alloc.

Yes, that's a good idea. That way perhaps we reduce chance that kfree_rcu()
enters into direct-reclaim. Let us do it that way. At least I don't see any
drawbacks in such approach.

thanks,

 - Joel