On 12/5/22 8:29 PM, Hou Tao wrote:
From: Hou Tao <houtao1@xxxxxxxxxx> When there are batched freeing operations on a specific CPU, part of the freed elements ((high_watermark - lower_watermark) / 2 + 1) will be moved to waiting_for_gp list and the remaining part will be left in free_by_rcu list and waits for the expiration of RCU-tasks-trace grace period and the next invocation of free_bulk().
The change below LGTM. However, the above description seems not precise. IIUC, free_by_rcu list => waiting_for_gp is controlled by whether call_rcu_in_progress is true or not. If it is true, free_by_rcu list will remain intact and not moving into waiting_for_gp list. So it is not 'the remaining part will be left in free_by_rcu'. It is all elements in free_by_rcu to waiting_for_gp or none.
So instead of invoking __alloc_percpu_gfp() or kmalloc_node() to allocate a new object, in alloc_bulk() just check whether or not there is freed element in free_by_rcu and reuse it if available. Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx>
LGTM except the above suggestions. Acked-by: Yonghong Song <yhs@xxxxxx>
--- kernel/bpf/memalloc.c | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 8f0d65f2474a..7daf147bc8f6 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -171,9 +171,24 @@ static void alloc_bulk(struct bpf_mem_cache *c, int cnt, int node) memcg = get_memcg(c); old_memcg = set_active_memcg(memcg); for (i = 0; i < cnt; i++) { - obj = __alloc(c, node); - if (!obj) - break; + /* + * free_by_rcu is only manipulated by irq work refill_work(). + * IRQ works on the same CPU are called sequentially, so it is + * safe to use __llist_del_first() here. If alloc_bulk() is + * invoked by the initial prefill, there will be no running + * irq work, so __llist_del_first() is fine as well. + * + * In most cases, objects on free_by_rcu are from the same CPU. + * If some objects come from other CPUs, it doesn't incur any + * harm because NUMA_NO_NODE means the preference for current + * numa node and it is not a guarantee. + */ + obj = __llist_del_first(&c->free_by_rcu); + if (!obj) { + obj = __alloc(c, node); + if (!obj) + break; + } if (IS_ENABLED(CONFIG_PREEMPT_RT)) /* In RT irq_work runs in per-cpu kthread, so disable * interrupts to avoid preemption and interrupts and