bpf memory allocator was introduced in v6.1 by Alexei Starovoitov [0]. Its main purpose is to provide an any-context allocator for bpf program which can be attached to anywhere (e.g., __kmalloc()) and any context (e.g., the NMI context through perf_event_overflow()). Before that, only pre-allocated hash map is usable for these contexts but it incurs memory waste because typically hash table is sparse. In addition to the memory saving, it also increases the performance of dynamically allocated hash map significantly and makes the hash-map usable for sleep-able program. As more use cases of bpf memory allocator emerges, it seems that there are some problems that need to be discussed and fixed. The first problem is the immediate reuse of elements in bpf memory allocator. The reason for immediate reuse is to prevent OOM for typical usage scenario for bpf hash map, but it introduces use-after-free problem [1] for dynamically-allocated hash table although the problems existed for pre-allocated hash-table since it was introduced. For hash-table, the reuse problem may be acceptable for hash table, but the reuse makes introducing new use cases more difficult. For example, in bpf-qp-trie [2] the internal nodes of qp-trie are managed by bpf memory allocator, if internal node used during lookup is freed and reused, the lookup procedure may panic or return an incorrect result. Although I have already implemented a qp-trie demo in which two version numbers are added for each internal node to ensure its validity: one version is saved in its parent node and another in itself, but I am not sure about its correctness. bpf_cpumask was introduced recently [3] is another example, in bpf_cpumask_kptr_get() it checks the validity of bpf_cpumask by checking whether or not its usage is zero, but I don't know how does it handle the reuse of bpf_cpumask if the cpumask is freed and then reused by another bpf_cpumask_create() call. Alexei proposed using BPF_MA_REUSE_AFTER_GP [4] to solve the reuse problem. For bpf ma with BPF_MA_REUSE_AFTER_GP, the freed objects are reused only after one RCU grace period and are returned back to slab system after one-RCU-grace-period and one-RCU-tasks-trace grace period. So for bpf programs which care about reuse problem, these programs can use bpf_rcu_read_{lock,unlock}() to access these freed objects safely and for those which doesn't care, there will be safely use-after-free because these objects have not been returned to slab subsystem. I was worried about the possibility of OOM for BPF_MA_REUSE_AFTER_GP, so I proposed using BPF_MA_FREE_AFTER_GP [5] to directly return these freed objects to slab system after one RCU grace period and enforce the accesses of these objects are protected by bpf_rcu_read_{lock,unlock}(). But if BPF_MA_FREE_AFTER_GP is used by local storage, it may break existing sleep-able program. Currently, I am working on BPF_MA_REUSE_AFTER_GP with Martin. Hope to work out an RFC soon. Another problem is the potential OOM problem. bpf memory allocator is more suitable for the following case: alloc, free, alloc, free on the same CPU. The above use case is also the typical use case for hash table, but for other use cases, bpf memory allocator doesn't handle the scenario well and may incur OOM. One such use case is batched allocation and batched freeing on same CPU. According to [6], for a small hash table, the peak memory for this use case can increase to 860MB or more. Another use case is allocation and free are done on different CPUs [6]. Memory usage exposure can easily occur for such case, because there is no reuse and these freed objects can only be returned to subsystem after one RCU tasks-trace grace period. I think the potential OOM problem can be attacked by two ways. One is returning these freed objects to slab system timely. Although some work (e.g., skip unnecessary call_rcu for freeing [7]) has been done, but I think it is not enough. For example, for bpf_global_ma, because it will not be destroyed like bpf ma in hash-tab, so there may still be freed objects in per-cpu free_by_rcu list and will not be freed if there is no free operations on this CPU afterwards. Also there is no ways to limit the memory usage of bpf_global_ma because its usage is accounted under root memcg, so maybe a shrinker is also needed to free some memory back to slab system. Another example is CPU hot-plug. Because bpf memory allocator is a per-CPU allocator, so when one CPU is offline, all freed elements need be returned to slab system and when the CPU is online, we may need to do pre-fill for it. Another approach is to try to reuse freed object if possible. One fix [6] had already been done to fix the batched allocation and freed case, but for the case of allocation and freeing on different CPUs, it seems we may need to share freed object among multiple CPUs and do it cheaply. Not sure whether or not the issues above are important enough for a session, but I think a discussion in mail-list will be helpful as well. 0: https://lore.kernel.org/bpf/20220902211058.60789-1-alexei.starovoitov@xxxxxxxxx/ 1: https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@xxxxxxxxxxxxxxx/ 2: https://lore.kernel.org/bpf/20220924133620.4147153-1-houtao@xxxxxxxxxxxxxxx/ 3: https://lore.kernel.org/bpf/20230125143816.721952-1-void@xxxxxxxxxxxxx/ 4: https://lore.kernel.org/bpf/CAADnVQKecUqGF-gLFS5Wiz7_E-cHOkp7NPCUK0woHUmJG6hEuA@xxxxxxxxxxxxxx/ 5: https://lore.kernel.org/bpf/2a58c4a8-781f-6d84-e72a-f8b7117762b4@xxxxxxxxxxxxxxx/ 6: https://lore.kernel.org/bpf/20221209010947.3130477-1-houtao@xxxxxxxxxxxxxxx/ 7: https://lore.kernel.org/bpf/20221014113946.965131-3-houtao@xxxxxxxxxxxxxxx/