On Sun, Apr 23, 2023 at 03:41:05PM +0800, Hou Tao wrote: > >> > >> (3) reuse-after-rcu-gp bpf memory allocator > > that's the one you're implementing below, right? > Right. > > > >> | name | loop (k/s) | average memory (MiB) | peak memory (MiB) | > >> | -- | -- | -- | -- | > >> | no_op | 1276 | 0.96 | 1.00 | > >> | overwrite | 15.66 | 25.00 | 33.07 | > >> | batch_add_batch_del | 10.32 | 18.84 | 22.64 | > >> | add_del_on_diff_cpu | 13.00 | 550.50 | 748.74 | > >> > >> (4) free-after-rcu-gp bpf memory allocator (free directly through call_rcu) > > What do you mean? htab uses bpf_ma, but does call_rcu before doing bpf_mem_free ? > No, there is no call_rcu() before bpf_mem_free(). bpf_mem_free() in > free-after-rcu-gp flavor will do call_rcu() in batch to free these elements back > to slab subsystem directly. The elements in this flavor of bpf_ma is not safe > for access from sleepable program except bpf_rcu_read_{lock,unlock}() are used. > > But I think using call_rcu() to call bpf_mem_free() is good candidate for > comparison and I saw bpf_cpumask does that, so I modify bpf hash table to do the > similar thing and paste the benchmark result. As we can seen from the result, > the memory usage for such flavor is much bigger than reuse-after-rcu-gp and > free-after-rcu-gp: I don't follow what exactly you're doing and what you're measuring. Please provide patches for both reuse-after-rcu-gp and free-after-rcu-gp to have meaningful conversation. Rigth now we're stuck at what bench tool is actually measuring. > >> + if (try_queue_work && !work_pending(&c->reuse_work)) { > >> + /* Use reuse_cb_in_progress to indicate there is > >> + * inflight reuse kworker or reuse RCU callback. > >> + */ > >> + atomic_inc(&c->reuse_cb_in_progress); > >> + /* Already queued */ > >> + if (!queue_work(bpf_ma_wq, &c->reuse_work)) > > how many kthreads are spawned by wq in the peak? > I think it depends on the number of bpf_ma. Because bpf_ma_wq is per-CPU > workqueue, so for each bpf_ma, there is at most one worker for each CPU. And now > the limit for the number of active workers on each CPU is 256, but it is > customizable through alloc_workqueue() API. Which means that on 8 cpu system there will be 8 * 256 kthreads ? That's a lot. Please provide num_of_all_threads before/after/at_peak during bench. Pls trim your replies. Mailers like mutt have a hard time navigating.