Hi, On 6/8/2023 11:35 AM, Hou Tao wrote: > Hi, > > On 6/8/2023 8:34 AM, Alexei Starovoitov wrote: >> On Wed, Jun 7, 2023 at 5:13 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: >>> On Wed, Jun 07, 2023 at 04:50:35PM -0700, Alexei Starovoitov wrote: >>>> On Wed, Jun 7, 2023 at 4:30 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote: > SNIP > > By comparing the implementation of v3 and v4, I just find one hack which > could reduce the memory usage of v4 (with per-cpu reusabe list) > significantly and memory usage will be similar between v3 and v4. If we > queue a empty work before calling irq_work_raise() as shown below, the > calling latency of reuse_rcu (a normal RCU callback) will decreased from > ~150ms to ~10 ms. I think the reason is that the duration of normal RCU > grace period is decreased a lot, but I don't know why did it happen. > Hope to get some help from Paul. Because Paul doesn't have enough > context, so I will try to explain the context of the weird problem > below. And Alexei, could you please also try the hack below for your > multiple-rcu-cbs version ? An update for the queue_work() hack. It works for both CONFIG_PREEMPT=y and CONFIG_PREEMPT=n cases. I will try to enable RCU trace to check whether or not there is any difference. > > Hi Paul, > > I just found out the time between the calling of call_rcu(.., > reuse_rcub) and the calling of RCU callback (namely resue_cb()) will > decrease a lot (from ~150ms to ~10ms) if I queued a empty kworker > periodically as shown in the diff below. Before the diff below applied, > the benchmark process will do the following things on a VM with 8 CPUs: > > 1) create a pthread htab_mem_producer on each CPU and pinned the thread > on the specific CPU > 2) htab_mem_producer will call syscall(__NR_getpgid) repeatedly in a > dead-loop > 3) the calling of getpgid() will trigger the invocation of a bpf program > attached on getpgid() syscall > 4) the bpf program will overwrite 2048 elements in a bpf hash map > 5) during the overwrite, it will free the existed element firstly > 6) the free will call unit_free(), unit_free() will trigger a irq-work > batchly after 96-element were freed > 7) in the irq_work, it will allocate a new struct to save the freed > elements and the rcu_head and do call_rcu(..., reuse_rcu) > 8) in reuse_rcu() it just moves these freed elements into a per-cpu > reuse list > 9) After the free completes, the overwrite will allocate a new element > 10) the allocation may also trigger a irq-work batchly after the > preallocated elements were exhausted > 11) in the irq-work, it will try to fetch elements from per-cpu reuse > list and if it is empty, it will do kmalloc() > > For the procedure describe above, the calling latency between the call > of call_rcu() and the call of reuse_rcu is about ~150ms or larger. I > have also checked the calling latency of syscall(__NR_getpgid) and all > latency is less than 1ms. But after do queueing a empty kwork in step > 6), the calling latency will decreased from ~150ms to ~10ms and I > suspect that is because the RCU grace period is decreased a lot, but I > don't know how to debug that (e.g., to debug why the RCU grace period is > so long), so I hope to get some help. > > htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + multiple reuse_rcu() callbacks + queue_empty_work): > | name | loop (k/s)| average memory (MiB)| peak memory (MiB)| > | -- | -- | -- | -- | > | overwrite | 13.85 | 17.89 | 21.49 | > | batch_add_batch_del| 10.22 | 16.65 | 19.07 | > | add_del_on_diff_cpu| 3.82 | 21.36 | 33.05 | > > > +static void bpf_ma_prepare_reuse_work(struct work_struct *work) > +{ > + udelay(100); > +} > + > /* When size != 0 bpf_mem_cache for each cpu. > * This is typical bpf hash map use case when all elements have equal size. > * > @@ -547,6 +559,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int > size, bool percpu) > c->cpu = cpu; > c->objcg = objcg; > c->percpu_size = percpu_size; > + INIT_WORK(&c->reuse_work, > bpf_ma_prepare_reuse_work); > raw_spin_lock_init(&c->lock); > c->reuse.percpu = percpu; > c->reuse.cpu = cpu; > @@ -574,6 +587,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int > size, bool percpu) > c->unit_size = sizes[i]; > c->cpu = cpu > c->objcg = objcg; > + INIT_WORK(&c->reuse_work, > bpf_ma_prepare_reuse_work); > raw_spin_lock_init(&c->lock); > c->reuse.percpu = percpu; > c->reuse.lock = &c->lock; > @@ -793,6 +807,8 @@ static void notrace unit_free(struct bpf_mem_cache > *c, void *ptr) > c->prepare_reuse_tail = llnode; > __llist_add(llnode, &c->prepare_reuse_head); > cnt = ++c->prepare_reuse_cnt; > + if (cnt > c->high_watermark && > !work_pending(&c->reuse_work)) > + queue_work(bpf_ma_wq, &c->reuse_work); > } else { > /* unit_free() cannot fail. Therefore add an object to > atomic > * llist. reuse_bulk() will drain it. Though > free_llist_extra is > @@ -901,3 +917,11 @@ void notrace *bpf_mem_cache_alloc_flags(struct > bpf_mem_alloc *ma, gfp_t flags) > > return !ret ? NULL : ret + LLIST_NODE_SZ; > } > + > +static int __init bpf_ma_init(void) > +{ > + bpf_ma_wq = alloc_workqueue("bpf_ma", WQ_MEM_RECLAIM, 0); > + BUG_ON(!bpf_ma_wq); > + return 0; > +} > +late_initcall(bpf_ma_init); > > > >> Could you point me to a code in RCU where it's doing callback batching? > .