Hi Alexei, On 4/27/2023 12:24 PM, Alexei Starovoitov wrote: > On Sun, Apr 23, 2023 at 03:41:05PM +0800, Hou Tao wrote: >>>> (3) reuse-after-rcu-gp bpf memory allocator >>> that's the one you're implementing below, right? >> Right. >>>> | name | loop (k/s) | average memory (MiB) | peak memory (MiB) | >>>> | -- | -- | -- | -- | >>>> | no_op | 1276 | 0.96 | 1.00 | >>>> | overwrite | 15.66 | 25.00 | 33.07 | >>>> | batch_add_batch_del | 10.32 | 18.84 | 22.64 | >>>> | add_del_on_diff_cpu | 13.00 | 550.50 | 748.74 | >>>> >>>> (4) free-after-rcu-gp bpf memory allocator (free directly through call_rcu) >>> What do you mean? htab uses bpf_ma, but does call_rcu before doing bpf_mem_free ? >> No, there is no call_rcu() before bpf_mem_free(). bpf_mem_free() in >> free-after-rcu-gp flavor will do call_rcu() in batch to free these elements back >> to slab subsystem directly. The elements in this flavor of bpf_ma is not safe >> for access from sleepable program except bpf_rcu_read_{lock,unlock}() are used. >> >> But I think using call_rcu() to call bpf_mem_free() is good candidate for >> comparison and I saw bpf_cpumask does that, so I modify bpf hash table to do the >> similar thing and paste the benchmark result. As we can seen from the result, >> the memory usage for such flavor is much bigger than reuse-after-rcu-gp and >> free-after-rcu-gp: > I don't follow what exactly you're doing and what you're measuring. > Please provide patches for both reuse-after-rcu-gp and free-after-rcu-gp to > have meaningful conversation. OK. Will add a new flavor of FREE_AFTER_RCU_GP bpf memory allocator in v3. > Rigth now we're stuck at what bench tool is actually measuring. > >>>> + if (try_queue_work && !work_pending(&c->reuse_work)) { >>>> + /* Use reuse_cb_in_progress to indicate there is >>>> + * inflight reuse kworker or reuse RCU callback. >>>> + */ >>>> + atomic_inc(&c->reuse_cb_in_progress); >>>> + /* Already queued */ >>>> + if (!queue_work(bpf_ma_wq, &c->reuse_work)) >>> how many kthreads are spawned by wq in the peak? >> I think it depends on the number of bpf_ma. Because bpf_ma_wq is per-CPU >> workqueue, so for each bpf_ma, there is at most one worker for each CPU. And now >> the limit for the number of active workers on each CPU is 256, but it is >> customizable through alloc_workqueue() API. > Which means that on 8 cpu system there will be 8 * 256 kthreads ? > That's a lot. Please provide num_of_all_threads before/after/at_peak during bench. Yes, 8 * 256 is a lot, but there are at most 8 kworkers during benchmark, because there is only one bpf_memory_allocator is used. > > Pls trim your replies. Mailers like mutt have a hard time navigating. Do you mean the email content didn't wrap automatically ? or the wrap length is too lengthy (my current setting is 80) ?