Re: [PATCH 1/1] rcu/tree: add emergency pool for headless case

Uladzislau Rezki <urezki@xxxxxxxxx> · Mon, 6 Apr 2020 14:56:40 +0200

Hello, Joel.

> > > 
> > > Hi Vlad,
> > > 
> > > One concern I have is this moves the problem a bit further down. My belief is
> > > we should avoid the likelihood of even needing an rcu_head allocated for the
> > > headless case, to begin with - than trying to do damage-control when it does
> > > happen. The only way we would end up needing an rcu_head is if we could not
> > > allocate an array.
> > > 
> > Let me share my view on all such caching. I think that now it becomes less as
> > the issue, because of we have now https://lkml.org/lkml/2020/4/2/383 patch.
> > I see that it does help a lot. I tried to simulate low memory condition and 
> > apply high memory pressure with that. I did not manage to trigger the
> > "synchronize rcu" path at all. It is because of using much more permissive
> > parameters when we request a memory from the SLAB(direct reclaim, etc...).
> 
> That's a good sign that we don't hit this path in your tests.
> 
Just one request, of course if you have a time :) Could you please
double check on your test environment to stress the system to check
if you also can not hit it?

How i test it. Please apply below patch:
<snip>
t a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 5e26145e9ead..25f7ac8583e1 100644

--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3203,6 +3203,7 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
 
        if (head) {
                ptr = (void *) head - (unsigned long) func;
+               head = NULL;
        } else {
                /*
                 * Please note there is a limitation for the head-less
@@ -3233,16 +3234,18 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
         * Under high memory pressure GFP_NOWAIT can fail,
         * in that case the emergency path is maintained.
         */
-       success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr);
-       if (!success) {
+       /* success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr); */
+       /* if (!success) { */
                /* Is headless object? */
                if (head == NULL) {
                        /* Drop the lock. */
                        krc_this_cpu_unlock(krcp, flags);
 
                        head = attach_rcu_head_to_object(ptr);
-                       if (head == NULL)
+                       if (head == NULL) {
+                               success = false;
                                goto inline_return;
+                       }
 
                        /* Take it back. */
                        krcp = krc_this_cpu_lock(&flags);
@@ -3267,7 +3270,7 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
                 */
                expedited_drain = true;
                success = true;
-       }
+       /* } */
 
        WRITE_ONCE(krcp->count, krcp->count + 1);
 
@@ -3297,7 +3300,9 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
                if (!rcu_kfree_nowarn)
                        WARN_ON_ONCE(1);
                debug_rcu_head_unqueue(ptr);
-               synchronize_rcu();
+               /* synchronize_rcu(); */
+               printk(KERN_ERR "-> hit synchronize_rcu() path.\n");
+               trace_printk("-> hit synchronize_rcu() path.\n");
                kvfree(ptr);
        }
 }
<snip>

lower the memory size and run kfree rcu tests. It would be appreciated.

>
> I guess also, with your latest patch on releasing the lock to be in a
> non-atomic context, and then doing the allocation, it became even more
> permissive? If you drop that patch and tried, do you still not hit the
> synchronous path more often?
> 
Yep. If i drop the patch, i can hit it.

>
> Could you also try introducing memory pressure by reducing your system's
> total memory and see how it behaves?
> 
I simulated low memory condition by setting the system memory to 145MB.
That was the minimum amount the KVM system was capable of booting. After
that i used kfree rcu tests to simulate memory pressure.

> > > So instead of adding a pool for rcu_head allocations, how do you feel about
> > > pre-allocation of the per-cpu cache array instead, which has the same effect
> > > as you are intending?
> > > 
> > In the v2 i have a list of such objects. It is also per-CPU(it is scaled to CPUs),
> > but the difference is, those objects require much less memory, it is 8 + sizeof(struct
> > rcu_head) bytes comparing to one page. Therefore the memory footprint is lower.
> 
> Yes, true. That is one drawback is it higher memory usage. But if you have at
> least 1 kfree_rcu() request an each CPU, then pre-allocation does not
> increase memory usage any more than it already has right now. Unless, we
> extend my proposal to cache more than 2 pages per-cpu which I think you
> mentioned below.
> 
If we cache two pages per-CPU, i think that is fine. When it comes to increasing
it, it can be a bit wasting. For example consider 128 CPUs system.

> > I have doubts that we would ever hit this emergency list, because of mentioned
> > above patch, but from the other hand i can not say and guarantee 100%. Just in
> > case, we may keep it. 
> 
> You really have to hit OOM in your tests to trigger it I suppose. Basically
> the emergency pool improves situation under OOM, but otherwise does not
> improve it due to the direct-reclaim that happens as you mentioned. Right?
>
See above how i simulated it. Direct reclaim is our GFP_KERNEL + other flags
helper. If even after reclaim process there is no memory, then emergency list
is supposed to be used. But we can drop this patch, i mean "emergency list"
if we agree on it. The good point would be if you could stress your system
by the i did. See above description :)

> > Paul, could you please share your view and opinion? It would be appreciated :)
> > 
> > > This has 3 benefits:
> > > 1. It scales with number of CPUs, no configuration needed.
> > > 2. It makes the first kfree_rcu() faster and less dependent on an allocation
> > >    succeeding.
> > > 3. Much simpler code, no new structures or special handling.
> > > 4. In the future we can extend it to allocate more than 2 pages per CPU using
> > >    the same caching mechanism.
> > > 
> > > The obvious drawback being its 2 pages per CPU but at least it scales by
> > > number of CPUs. Something like the following (just lightly tested):
> > > 
> > > ---8<-----------------------
> > > 
> > > From: "Joel Fernandes (Google)" <joel@xxxxxxxxxxxxxxxxx>
> > > Subject: [PATCH] rcu/tree: Preallocate the per-cpu cache for kfree_rcu()
> > > 
> > > In recent changes, we have made it possible to use kfree_rcu() without
> > > embedding an rcu_head in the object being free'd. This requires dynamic
> > > allocation. In case dynamic allocation fails due to memory pressure, we
> > > would end up synchronously waiting for an RCU grace period thus hurting
> > > kfree_rcu() latency.
> > > 
> > > To make this less probable, let us pre-allocate the per-cpu cache so we
> > > depend less on the dynamic allocation succeeding. This also has the
> > > effect of making kfree_rcu() slightly faster at run time.
> > > 
> > > Signed-off-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>
> > > ---
> > >  kernel/rcu/tree.c | 5 +++++
> > >  1 file changed, 5 insertions(+)
> > > 
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 6172e6296dd7d..9fbdeb4048425 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -4251,6 +4251,11 @@ static void __init kfree_rcu_batch_init(void)
> > >  			krcp->krw_arr[i].krcp = krcp;
> > >  		}
> > >  
> > > +		krcp->bkvcache[0] =  (struct kvfree_rcu_bulk_data *)
> > > +					__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> > > +		krcp->bkvcache[1] =  (struct kvfree_rcu_bulk_data *)
> > > +					__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> > > +
> > >  		INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor);
> > >  		krcp->initialized = true;
> > >  	}
> > We pre-allocate it, but differently comparing with your proposal :) I do not see
> > how it can improve things. The difference is you do it during initializing or booting  
> > phase. In case of current code it will pre-allocate and cache one page after first
> > calling of the kvfree_call_rcu(), say in one second. So basically both variants are
> > the same.
> 
> Well, one proposal is only 5 lines extra ;-). That has got to be at least a
> bit appealing ;-) ;-).
> 
:)

> > But i think that we should allow to be used two pages as cached ones, no matter 
> > whether it is vmalloc ptrs. or SLAB ones. So basically, two cached pages can be
> > used by vmalloc path and SLAB path. And probably it makes sense because of two
> > phases: one is when we collect pointers, second one is memory reclaim path. Thus
> > one page per one phase, i.e. it would be paired.
> 
> You are saying this with regard to my proposal right?  I agree number of
> pages could be increased. The caching mechanism already in-place could be
> starting point for that extension.
> 
We already have two pages. What we need is to allow to use them in both
paths, vmalloc one and SLAB one, i mean reclaim path. At least that fits
well to the collecting/reclaim phases.

Thanks for comments!

--
Vlad Rezki