Hi, On 5/4/2023 10:00 AM, Alexei Starovoitov wrote: > On Thu, May 04, 2023 at 09:35:17AM +0800, Hou Tao wrote: >> Hi, >> >> On 5/4/2023 2:48 AM, Alexei Starovoitov wrote: >>> On Sat, Apr 29, 2023 at 06:12:12PM +0800, Hou Tao wrote: SNIP >>> + /* Already queued */ >>> + if (!queue_work(bpf_ma_wq, &c->reuse_work)) >>> As Martin pointed out queue_work() is not safe here. >>> The raw_spin_lock(&c->reuse_lock); earlier is not safe either. >> I see. Didn't recognize these problems. >>> For the next version please drop workers and spin_lock from unit_free/alloc paths. >>> If lock has to be taken it should be done from irq_work. >>> Under no circumstances we can use alloc_workqueue(). No new kthreads. >> Is there any reason to prohibit the use of new kthread in irq_work ? > Because: > 1. there is a workable solution without kthreads. > 2. if there was no solution we would have to come up with one. > kthread is not an answer. It's hard to reason about a setup when kthreads > are in critical path due to scheduler. Assume the system is 100% cpu loaded. > kthreads delays and behavior is unpredictable. We cannot subject memory alloc/free to it. I see. Thanks for the explanation. > >>> We can avoid adding new flag to bpf_mem_alloc to reduce the complexity >>> and do roughly equivalent of REUSE_AFTER_RCU_GP unconditionally in the following way: >>> >>> - alloc_bulk() won't be trying to steal from c->free_by_rcu. >>> >>> - do_call_rcu() does call_rcu(&c->rcu, __free_rcu) instead of task-trace version. >> No sure whether or not one inflight RCU callback is enough. Will check. >> If one is not enough, I may use kmalloc(__GFP_NOWAIT) in irq work to >> allocate multiple RCU callbacks. > Pls dont. Just assume it will work, implement the proposal (if you agree), > come back with the numbers and then we will discuss again. > We cannot keep arguing about merits of complicated patch set that was done on partial data. OK. Will do. > Just like the whole thing with kthreads. > I requested early on: "pls no kthreads" and weeks later we're still arguing. Sorry about missing that part. > >>> - rcu_trace_implies_rcu_gp() is never used. >>> >>> - after RCU_GP __free_rcu() moves all waiting_for_gp elements into >>> a size specific link list per bpf_mem_alloc (not per bpf_mem_cache which is per-cpu) >>> and does call_rcu_tasks_trace >>> >>> - Let's call this list ma->free_by_rcu_tasks_trace >>> (only one list for bpf_mem_alloc with known size or NUM_CACHES such lists when size == 0 at init) >>> >>> - any cpu alloc_bulk() can steal from size specific ma->free_by_rcu_tasks_trace list that >>> is protected by ma->spin_lock (1 or NUM_CACHES such locks) >> To reduce the lock contention, alloc_bulk() can steal from the global >> list in batch. > Pls no special batches. The simplest implementation possible. > alloc_bulk() has 'int cnt' argument. It will try to steal 'cnt' from ma->free_by_rcu_tasks_trace. I see. Will do. > >> Had tried the global list before but I didn't do the >> concurrent freeing, I think it could reduce the risk of OOM for >> add_del_on_diff_cpu. > Maybe you've tried, but we didn't see the patches and we cannot take for granted > anyone saying: "I've tried *foo*. It didn't work. That's why I'm doing *bar* here". > Everything mm is tricky. Little details matter a lot. OK. I think it will work. The reason I didn't post it is that I was obsessed with lock-less bpf ma at that moment. > It's also questionable whether we should make any design decisions based on this benchmark > and in particular based on add_del_on_diff_cpu part of it. > I'm not saying we shouldn't consider it, but all numbers have a "decision weight" > associated with them. I see. The reason for add_del_on_diff_cpu is just to complement the possible use cases of bpf memory allocator. > For example: there is existing samples/bpf/map_perf_test benchmark. > So far we haven't seen the numbers from it. > Is it more important than your new bench? Yes and no. All numbers matter. Will post the benchmark result for map_perf_test in v4. Had planned to migrate map_perf_test to selftests/bpf/benchs, but couldn't find enough time to do that. > >>> - ma->waiting_for_gp_tasks_trace will be freeing elements into slab >>> >>> What it means that sleepable progs using hashmap will be able to avoid uaf with bpf_rcu_read_lock(). >>> Without explicit bpf_rcu_read_lock() it's still safe and equivalent to existing behavior of bpf_mem_alloc. >>> (while your proposed BPF_MA_FREE_AFTER_RCU_GP flavor is not safe to use in hashtab with sleepable progs) >>> >>> After that we can unconditionally remove rcu_head/call_rcu from bpf_cpumask and improve usability of bpf_obj_drop. >>> Probably usage of bpf_mem_alloc in local storage can be simplified as well. >>> Martin wdyt? >>> >>> I think this approach adds minimal complexity to bpf_mem_alloc while solving all existing pain points >>> including needs of qp-trie. >> Thanks for these great suggestions. Will try to do it in v4. > Thanks. > Also for benchmark, pls don't hack htab and benchmark as 'non-landable patches' (as in this series). > Construct the patch series as: > - prep patches > - benchmark > - unconditional convert of bpf_ma to REUSE_AFTER_rcu_GP_and_free_after_rcu_tasks_trace > with numbers from bench(s) before and after this patch. Thanks again for the suggestion. Will do in v4.