On Tue, Jan 3, 2023 at 5:40 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote: > > Hi, > > On 1/1/2023 9:26 AM, Alexei Starovoitov wrote: > > On Fri, Dec 30, 2022 at 12:11:45PM +0800, Hou Tao wrote: > >> From: Hou Tao <houtao1@xxxxxxxxxx> > >> > >> Hi, > >> > >> The patchset tries to fix the problems found when checking how htab map > >> handles element reuse in bpf memory allocator. The immediate reuse of > >> freed elements may lead to two problems in htab map: > >> > >> (1) reuse will reinitialize special fields (e.g., bpf_spin_lock) in > >> htab map value and it may corrupt lookup procedure with BFP_F_LOCK > >> flag which acquires bpf-spin-lock during value copying. The > >> corruption of bpf-spin-lock may result in hard lock-up. > >> (2) lookup procedure may get incorrect map value if the found element is > >> freed and then reused. > >> > >> Because the type of htab map elements are the same, so problem #1 can be > >> fixed by supporting ctor in bpf memory allocator. The ctor initializes > >> these special fields in map element only when the map element is newly > >> allocated. If it is just a reused element, there will be no > >> reinitialization. > > Instead of adding the overhead of ctor callback let's just > > add __GFP_ZERO to flags in __alloc(). > > That will address the issue 1 and will make bpf_mem_alloc behave just > > like percpu_freelist, so hashmap with BPF_F_NO_PREALLOC and default > > will behave the same way. > Use __GPF_ZERO will be simpler, but the overhead of memset() for every allocated > object may be bigger than ctor callback when the size of allocated object is > large. And it also introduces unnecessary memory zeroing if there is no special > field in the map value. Small memset is often faster than an indirect call. I doubt that adding GFP_ZERO will have a measurable perf difference in map_perf_test and other benchmarks. > >> Problem #2 exists for both non-preallocated and preallocated htab map. > >> By adding seq in htab element, doing reuse check and retrying the > >> lookup procedure may be a feasible solution, but it will make the > >> lookup API being hard to use, because the user needs to check whether > >> the found element is reused or not and repeat the lookup procedure if it > >> is reused. A simpler solution would be just disabling freed elements > >> reuse and freeing these elements after lookup procedure ends. > > You've proposed this 'solution' twice already in qptrie thread and both > > times the answer was 'no, we cannot do this' with reasons explained. > > The 3rd time the answer is still the same. > This time a workable demo which calls call_rcu_task_trace() in batch is provided > :) Also because I can not find a better solution for the reuse problem. But you > are right, although don't reuse the freed element will make the implementation > of map simpler, the potential OOM problem is hard to solve specially when RCU > tasks trace grace period is slow. Hope Paul can provide some insight about the > problem. OOM is exactly the reason why we cannot do this delaying logic in the general case. We don't control what progs do and rcu tasks trace may take a long time. > > This 'issue 2' existed in hashmap since very beginning for many years. > > It's a known quirk. There is nothing to fix really. > Do we need to document the unexpected behavior somewhere, because I really don't > know nothing about the quirk ? Yeah. It's not documented in Documentation/bpf/map_hash.rst. Please send a patch to add it. > > > > The graph apis (aka new gen data structs) with link list and rbtree are > > in active development. Soon bpf progs will be able to implement their own > > hash maps with explicit bpf_rcu_read_lock. At that time the progs will > > be making the trade off between performance and lookup/delete race. > It seems these new gen data struct also need to solve the reuse problem because > a global bpf memory allocator is used. Currently the graph api is single owner and kptr_xchg is used to allow single cpu at a time to observe the object. In the future we will add bpf_refcount and multi owner. Then multiple cpus will be able to access the same object concurrently. They will race to read/write the fields and it will be prog decision to arbitrate the access. In other words the bpf prog needs to have mechanisms to deal with reuse.