Hello, Joel. > > > > > > Hi Vlad, > > > > > > One concern I have is this moves the problem a bit further down. My belief is > > > we should avoid the likelihood of even needing an rcu_head allocated for the > > > headless case, to begin with - than trying to do damage-control when it does > > > happen. The only way we would end up needing an rcu_head is if we could not > > > allocate an array. > > > > > Let me share my view on all such caching. I think that now it becomes less as > > the issue, because of we have now https://lkml.org/lkml/2020/4/2/383 patch. > > I see that it does help a lot. I tried to simulate low memory condition and > > apply high memory pressure with that. I did not manage to trigger the > > "synchronize rcu" path at all. It is because of using much more permissive > > parameters when we request a memory from the SLAB(direct reclaim, etc...). > > That's a good sign that we don't hit this path in your tests. > Just one request, of course if you have a time :) Could you please double check on your test environment to stress the system to check if you also can not hit it? How i test it. Please apply below patch: <snip> t a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 5e26145e9ead..25f7ac8583e1 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3203,6 +3203,7 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func) if (head) { ptr = (void *) head - (unsigned long) func; + head = NULL; } else { /* * Please note there is a limitation for the head-less @@ -3233,16 +3234,18 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func) * Under high memory pressure GFP_NOWAIT can fail, * in that case the emergency path is maintained. */ - success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr); - if (!success) { + /* success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr); */ + /* if (!success) { */ /* Is headless object? */ if (head == NULL) { /* Drop the lock. */ krc_this_cpu_unlock(krcp, flags); head = attach_rcu_head_to_object(ptr); - if (head == NULL) + if (head == NULL) { + success = false; goto inline_return; + } /* Take it back. */ krcp = krc_this_cpu_lock(&flags); @@ -3267,7 +3270,7 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func) */ expedited_drain = true; success = true; - } + /* } */ WRITE_ONCE(krcp->count, krcp->count + 1); @@ -3297,7 +3300,9 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func) if (!rcu_kfree_nowarn) WARN_ON_ONCE(1); debug_rcu_head_unqueue(ptr); - synchronize_rcu(); + /* synchronize_rcu(); */ + printk(KERN_ERR "-> hit synchronize_rcu() path.\n"); + trace_printk("-> hit synchronize_rcu() path.\n"); kvfree(ptr); } } <snip> lower the memory size and run kfree rcu tests. It would be appreciated. > > I guess also, with your latest patch on releasing the lock to be in a > non-atomic context, and then doing the allocation, it became even more > permissive? If you drop that patch and tried, do you still not hit the > synchronous path more often? > Yep. If i drop the patch, i can hit it. > > Could you also try introducing memory pressure by reducing your system's > total memory and see how it behaves? > I simulated low memory condition by setting the system memory to 145MB. That was the minimum amount the KVM system was capable of booting. After that i used kfree rcu tests to simulate memory pressure. > > > So instead of adding a pool for rcu_head allocations, how do you feel about > > > pre-allocation of the per-cpu cache array instead, which has the same effect > > > as you are intending? > > > > > In the v2 i have a list of such objects. It is also per-CPU(it is scaled to CPUs), > > but the difference is, those objects require much less memory, it is 8 + sizeof(struct > > rcu_head) bytes comparing to one page. Therefore the memory footprint is lower. > > Yes, true. That is one drawback is it higher memory usage. But if you have at > least 1 kfree_rcu() request an each CPU, then pre-allocation does not > increase memory usage any more than it already has right now. Unless, we > extend my proposal to cache more than 2 pages per-cpu which I think you > mentioned below. > If we cache two pages per-CPU, i think that is fine. When it comes to increasing it, it can be a bit wasting. For example consider 128 CPUs system. > > I have doubts that we would ever hit this emergency list, because of mentioned > > above patch, but from the other hand i can not say and guarantee 100%. Just in > > case, we may keep it. > > You really have to hit OOM in your tests to trigger it I suppose. Basically > the emergency pool improves situation under OOM, but otherwise does not > improve it due to the direct-reclaim that happens as you mentioned. Right? > See above how i simulated it. Direct reclaim is our GFP_KERNEL + other flags helper. If even after reclaim process there is no memory, then emergency list is supposed to be used. But we can drop this patch, i mean "emergency list" if we agree on it. The good point would be if you could stress your system by the i did. See above description :) > > Paul, could you please share your view and opinion? It would be appreciated :) > > > > > This has 3 benefits: > > > 1. It scales with number of CPUs, no configuration needed. > > > 2. It makes the first kfree_rcu() faster and less dependent on an allocation > > > succeeding. > > > 3. Much simpler code, no new structures or special handling. > > > 4. In the future we can extend it to allocate more than 2 pages per CPU using > > > the same caching mechanism. > > > > > > The obvious drawback being its 2 pages per CPU but at least it scales by > > > number of CPUs. Something like the following (just lightly tested): > > > > > > ---8<----------------------- > > > > > > From: "Joel Fernandes (Google)" <joel@xxxxxxxxxxxxxxxxx> > > > Subject: [PATCH] rcu/tree: Preallocate the per-cpu cache for kfree_rcu() > > > > > > In recent changes, we have made it possible to use kfree_rcu() without > > > embedding an rcu_head in the object being free'd. This requires dynamic > > > allocation. In case dynamic allocation fails due to memory pressure, we > > > would end up synchronously waiting for an RCU grace period thus hurting > > > kfree_rcu() latency. > > > > > > To make this less probable, let us pre-allocate the per-cpu cache so we > > > depend less on the dynamic allocation succeeding. This also has the > > > effect of making kfree_rcu() slightly faster at run time. > > > > > > Signed-off-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx> > > > --- > > > kernel/rcu/tree.c | 5 +++++ > > > 1 file changed, 5 insertions(+) > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > > index 6172e6296dd7d..9fbdeb4048425 100644 > > > --- a/kernel/rcu/tree.c > > > +++ b/kernel/rcu/tree.c > > > @@ -4251,6 +4251,11 @@ static void __init kfree_rcu_batch_init(void) > > > krcp->krw_arr[i].krcp = krcp; > > > } > > > > > > + krcp->bkvcache[0] = (struct kvfree_rcu_bulk_data *) > > > + __get_free_page(GFP_NOWAIT | __GFP_NOWARN); > > > + krcp->bkvcache[1] = (struct kvfree_rcu_bulk_data *) > > > + __get_free_page(GFP_NOWAIT | __GFP_NOWARN); > > > + > > > INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor); > > > krcp->initialized = true; > > > } > > We pre-allocate it, but differently comparing with your proposal :) I do not see > > how it can improve things. The difference is you do it during initializing or booting > > phase. In case of current code it will pre-allocate and cache one page after first > > calling of the kvfree_call_rcu(), say in one second. So basically both variants are > > the same. > > Well, one proposal is only 5 lines extra ;-). That has got to be at least a > bit appealing ;-) ;-). > :) > > But i think that we should allow to be used two pages as cached ones, no matter > > whether it is vmalloc ptrs. or SLAB ones. So basically, two cached pages can be > > used by vmalloc path and SLAB path. And probably it makes sense because of two > > phases: one is when we collect pointers, second one is memory reclaim path. Thus > > one page per one phase, i.e. it would be paired. > > You are saying this with regard to my proposal right? I agree number of > pages could be increased. The caching mechanism already in-place could be > starting point for that extension. > We already have two pages. What we need is to allow to use them in both paths, vmalloc one and SLAB one, i mean reclaim path. At least that fits well to the collecting/reclaim phases. Thanks for comments! -- Vlad Rezki