On Wed, Aug 28, 2019 at 07:52:53AM +0800, Boqun Feng wrote: > Hi Joel, > > On Tue, Aug 27, 2019 at 03:01:56PM -0400, Joel Fernandes (Google) wrote: > > During testing, it was observed that amount of memory consumed due > > kfree_rcu() batching is 300-400MB. Previously we had only a single > > head_free pointer pointing to the list of rcu_head(s) that are to be > > freed after a grace period. Until this list is drained, we cannot queue > > any more objects on it since such objects may not be ready to be > > reclaimed when the worker thread eventually gets to drainin g the > > head_free list. > > > > We can do better by maintaining multiple lists as done by this patch. > > Testing shows that memory consumption came down by around 100-150MB with > > just adding another list. Adding more than 1 additional list did not > > show any improvement. > > > > Suggested-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxx> > > Signed-off-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx> > > --- > > kernel/rcu/tree.c | 64 +++++++++++++++++++++++++++++++++-------------- > > 1 file changed, 45 insertions(+), 19 deletions(-) > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > index 4f7c3096d786..9b9ae4db1c2d 100644 > > --- a/kernel/rcu/tree.c > > +++ b/kernel/rcu/tree.c > > @@ -2688,28 +2688,38 @@ EXPORT_SYMBOL_GPL(call_rcu); > > > > /* Maximum number of jiffies to wait before draining a batch. */ > > #define KFREE_DRAIN_JIFFIES (HZ / 50) > > +#define KFREE_N_BATCHES 2 > > + > > +struct kfree_rcu_work { > > + /* The rcu_work node for queuing work with queue_rcu_work(). The work > > + * is done after a grace period. > > + */ > > + struct rcu_work rcu_work; > > + > > + /* The list of objects that have now left ->head and are queued for > > + * freeing after a grace period. > > + */ > > + struct rcu_head *head_free; > > + > > + struct kfree_rcu_cpu *krcp; > > +}; > > +static DEFINE_PER_CPU(__typeof__(struct kfree_rcu_work)[KFREE_N_BATCHES], krw); > > > > Why not > > static DEFINE_PER_CPU(struct kfree_rcu_work[KFREE_N_BATCHES], krw); > > here? Am I missing something? Yes, that's better. > Further, given "struct kfree_rcu_cpu" is only for defining percpu > variables, how about orginazing the data structure like: > > struct kfree_rcu_cpu { > ... > struct kfree_rcu_work krws[KFREE_N_BATCHES]; > ... > } > > This could save one pointer in kfree_rcu_cpu, and I think it provides > better cache locality for accessing _cpu and _work on the same cpu. > > Thoughts? Yes, that's better. Thanks, Boqun! Following is the diff which I will fold into this patch: ---8<----------------------- diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index b3259306b7a5..fac5ae96d8b1 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2717,7 +2717,6 @@ struct kfree_rcu_work { struct kfree_rcu_cpu *krcp; }; -static DEFINE_PER_CPU(__typeof__(struct kfree_rcu_work)[KFREE_N_BATCHES], krw); /* * Maximum number of kfree(s) to batch, if this limit is hit then the batch of @@ -2731,7 +2730,7 @@ struct kfree_rcu_cpu { struct rcu_head *head; /* Pointer to the per-cpu array of kfree_rcu_work structures */ - struct kfree_rcu_work *krwp; + struct kfree_rcu_work krw_arr[KFREE_N_BATCHES]; /* Protect concurrent access to this structure and kfree_rcu_work. */ spinlock_t lock; @@ -2800,8 +2799,8 @@ static inline bool queue_kfree_rcu_work(struct kfree_rcu_cpu *krcp) lockdep_assert_held(&krcp->lock); while (i < KFREE_N_BATCHES) { - if (!krcp->krwp[i].head_free) { - krwp = &(krcp->krwp[i]); + if (!krcp->krw_arr[i].head_free) { + krwp = &(krcp->krw_arr[i]); break; } i++; @@ -3780,13 +3779,11 @@ static void __init kfree_rcu_batch_init(void) for_each_possible_cpu(cpu) { struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu); - struct kfree_rcu_work *krwp = &(per_cpu(krw, cpu)[0]); int i = KFREE_N_BATCHES; spin_lock_init(&krcp->lock); - krcp->krwp = krwp; while (i--) - krwp[i].krcp = krcp; + krcp->krw_arr[i].krcp = krcp; INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor); } }