Re: [RFC PATCH bpf-next v4 0/3] Handle immediate reuse in bpf memory allocator

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Thu, 8 Jun 2023 12:30:15 +0800



Hi,

On 6/8/2023 11:35 AM, Hou Tao wrote:
> Hi,
>
> On 6/8/2023 8:34 AM, Alexei Starovoitov wrote:
>> On Wed, Jun 7, 2023 at 5:13 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
>>> On Wed, Jun 07, 2023 at 04:50:35PM -0700, Alexei Starovoitov wrote:
>>>> On Wed, Jun 7, 2023 at 4:30 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
>
SNIP
>
> By comparing the implementation of v3 and v4, I just find one hack which
> could reduce the memory usage of v4 (with per-cpu reusabe list)
> significantly and memory usage will be similar between v3 and v4. If we
> queue a empty work before calling irq_work_raise() as shown below, the
> calling latency of reuse_rcu (a normal RCU callback) will decreased from
> ~150ms to ~10 ms. I think the reason is that the duration of normal RCU
> grace period is decreased a lot, but I don't know why did it happen.
> Hope to get some help from Paul. Because Paul doesn't have enough
> context, so I will try to explain the context of the weird problem
> below. And Alexei, could you please also try the hack below for your
> multiple-rcu-cbs version ?
An update for the queue_work() hack. It works for both CONFIG_PREEMPT=y
and CONFIG_PREEMPT=n cases. I will try to enable RCU trace to check
whether or not there is any difference.
>
> Hi Paul,
>
> I just found out the time between the calling of call_rcu(..,
> reuse_rcub) and the calling of RCU callback (namely resue_cb()) will
> decrease a lot (from ~150ms to ~10ms) if I queued a empty kworker
> periodically as shown in the diff below. Before the diff below applied,
> the benchmark process will do the following things on a VM with 8 CPUs:
>
> 1) create a pthread htab_mem_producer on each CPU and pinned the thread
> on the specific CPU
> 2) htab_mem_producer will call syscall(__NR_getpgid) repeatedly in a
> dead-loop
> 3) the calling of getpgid() will trigger the invocation of a bpf program
> attached on getpgid() syscall
> 4) the bpf program will overwrite 2048 elements in a bpf hash map
> 5) during the overwrite, it will free the existed element firstly
> 6) the free will call unit_free(), unit_free() will trigger a irq-work
> batchly after 96-element were freed
> 7) in the irq_work, it will allocate a new struct to save the freed
> elements and the rcu_head and do call_rcu(..., reuse_rcu)
> 8) in reuse_rcu() it just moves these freed elements into a per-cpu
> reuse list
> 9) After the free completes, the overwrite will allocate a new element
> 10) the allocation may also trigger a irq-work batchly after the
> preallocated elements were exhausted
> 11) in the irq-work, it will try to fetch elements from per-cpu reuse
> list and if it is empty, it will do kmalloc()
>
> For the procedure describe above, the calling latency between the call
> of call_rcu() and the call of reuse_rcu is about ~150ms or larger. I
> have also checked the calling latency of syscall(__NR_getpgid) and all
> latency is less than 1ms. But after do queueing a empty kwork in step
> 6), the calling latency will decreased from ~150ms to ~10ms and I
> suspect that is because the RCU grace period is decreased a lot, but I
> don't know how to debug that (e.g., to debug why the RCU grace period is
> so long), so I hope to get some help.
>
> htab-mem-benchmark (reuse-after-RCU-GP + per-cpu reusable list + multiple reuse_rcu() callbacks + queue_empty_work):
> | name               | loop (k/s)| average memory (MiB)| peak memory (MiB)|
> | --                 | --        | --                  | --               |
> | overwrite          | 13.85     | 17.89               | 21.49            |
> | batch_add_batch_del| 10.22     | 16.65               | 19.07            |
> | add_del_on_diff_cpu| 3.82      | 21.36               | 33.05            |
>
>
> +static void bpf_ma_prepare_reuse_work(struct work_struct *work)
> +{
> +       udelay(100);
> +}
> +
>  /* When size != 0 bpf_mem_cache for each cpu.
>   * This is typical bpf hash map use case when all elements have equal size.
>   *
> @@ -547,6 +559,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int
> size, bool percpu)
>                         c->cpu = cpu;
>                         c->objcg = objcg;
>                         c->percpu_size = percpu_size;
> +                       INIT_WORK(&c->reuse_work,
> bpf_ma_prepare_reuse_work);
>                         raw_spin_lock_init(&c->lock);
>                         c->reuse.percpu = percpu;
>                         c->reuse.cpu = cpu;
> @@ -574,6 +587,7 @@ int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int
> size, bool percpu)
>                         c->unit_size = sizes[i];
>                         c->cpu = cpu
>                         c->objcg = objcg;
> +                       INIT_WORK(&c->reuse_work,
> bpf_ma_prepare_reuse_work);
>                         raw_spin_lock_init(&c->lock);
>                         c->reuse.percpu = percpu;
>                         c->reuse.lock = &c->lock;
> @@ -793,6 +807,8 @@ static void notrace unit_free(struct bpf_mem_cache
> *c, void *ptr)
>                         c->prepare_reuse_tail = llnode;
>                 __llist_add(llnode, &c->prepare_reuse_head);
>                 cnt = ++c->prepare_reuse_cnt;
> +               if (cnt > c->high_watermark &&
> !work_pending(&c->reuse_work))
> +                       queue_work(bpf_ma_wq, &c->reuse_work);
>         } else {
>                 /* unit_free() cannot fail. Therefore add an object to
> atomic
>                  * llist. reuse_bulk() will drain it. Though
> free_llist_extra is
> @@ -901,3 +917,11 @@ void notrace *bpf_mem_cache_alloc_flags(struct
> bpf_mem_alloc *ma, gfp_t flags)
>
>         return !ret ? NULL : ret + LLIST_NODE_SZ;
>  }
> +
> +static int __init bpf_ma_init(void)
> +{
> +       bpf_ma_wq = alloc_workqueue("bpf_ma", WQ_MEM_RECLAIM, 0);
> +       BUG_ON(!bpf_ma_wq);
> +       return 0;
> +}
> +late_initcall(bpf_ma_init);
>
>
>
>> Could you point me to a code in RCU where it's doing callback batching?
> .