[LSF/MM/BPF TOPIC] Make bpf memory allocator more robust

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Sat, 25 Feb 2023 14:23:29 +0800

bpf memory allocator was introduced in v6.1 by Alexei Starovoitov [0]. Its main
purpose is to provide an any-context allocator for bpf program which can be
attached to anywhere (e.g., __kmalloc()) and any context (e.g., the NMI context
through perf_event_overflow()). Before that, only pre-allocated hash map is
usable for these contexts but it incurs memory waste because typically hash
table is sparse. In addition to the memory saving, it also increases the
performance of dynamically allocated hash map significantly and makes the
hash-map usable for sleep-able program. As more use cases of bpf memory
allocator emerges, it seems that there are some problems that need to be
discussed and fixed.

The first problem is the immediate reuse of elements in bpf memory allocator. 
The reason for immediate reuse is to prevent OOM for typical usage scenario for
bpf hash map, but it introduces use-after-free problem [1] for
dynamically-allocated hash table although the problems existed for pre-allocated
hash-table since it was introduced. For hash-table, the reuse problem may be
acceptable for hash table, but the reuse makes introducing new use cases more
difficult.

For example, in bpf-qp-trie [2] the internal nodes of qp-trie are managed by bpf
memory allocator, if internal node used during lookup is freed and reused, the
lookup procedure may panic or return an incorrect result. Although I have
already implemented a qp-trie demo in which two version numbers are added for
each internal node to ensure its validity: one version is saved in its parent
node and another in itself, but I am not sure about its correctness. bpf_cpumask
was introduced recently [3] is another example, in bpf_cpumask_kptr_get() it
checks the validity of bpf_cpumask by checking whether or not its usage is zero,
but I don't know how does it handle the reuse of bpf_cpumask if the cpumask is
freed and then reused by another bpf_cpumask_create() call.

Alexei proposed using BPF_MA_REUSE_AFTER_GP [4] to solve the reuse problem. For
bpf ma with BPF_MA_REUSE_AFTER_GP, the freed objects are reused only after one
RCU grace period and are returned back to slab system after one-RCU-grace-period
and one-RCU-tasks-trace grace period. So for bpf programs which care about reuse
problem, these programs can use bpf_rcu_read_{lock,unlock}() to access these
freed objects safely and for  those which doesn't care, there will be safely
use-after-free because these objects have not been returned to slab subsystem. I
was worried about the possibility of OOM for BPF_MA_REUSE_AFTER_GP, so I
proposed using BPF_MA_FREE_AFTER_GP [5] to directly return these freed objects
to slab system after one RCU grace period and enforce the accesses of these
objects are protected by bpf_rcu_read_{lock,unlock}(). But if
BPF_MA_FREE_AFTER_GP is used by local storage, it may break existing sleep-able
program. Currently, I am working on BPF_MA_REUSE_AFTER_GP  with Martin. Hope to
work out an RFC soon.

Another problem is the potential OOM problem. bpf memory allocator is more
suitable for the following case: alloc, free, alloc, free on the same CPU. The
above use case is also the typical use case for hash table, but for other use
cases, bpf memory allocator doesn't handle the scenario well and may incur OOM.
One such use case is batched allocation and batched freeing on same CPU.
According to [6], for a small hash table, the peak memory for this use case can
increase to 860MB or more. Another use case is allocation and free are done on
different CPUs [6]. Memory usage exposure can easily occur for such case,
because there is no reuse and these freed objects can only be returned to
subsystem after one RCU tasks-trace grace period.

I think the potential OOM problem can be attacked by two ways. One is returning
these freed objects to slab system timely. Although some work (e.g., skip
unnecessary call_rcu for freeing [7]) has been done, but I think it is not
enough. For example, for bpf_global_ma, because it will not be destroyed like
bpf ma in hash-tab, so there may still be freed objects in per-cpu free_by_rcu
list and will not be freed if there is no free operations on this CPU
afterwards. Also there is no ways to limit the memory usage of bpf_global_ma
because its usage is accounted under root memcg, so maybe a shrinker is also
needed to free some memory back to slab system. Another example is CPU hot-plug.
Because bpf memory allocator is a per-CPU allocator, so when one CPU is offline,
all freed elements need be returned to slab system and when the CPU is online,
we may need to do pre-fill for it. Another approach is to try to reuse freed
object if possible. One fix [6] had already been done to fix the batched
allocation and freed case, but for the case of allocation and freeing on
different CPUs, it seems we may need to share freed object among multiple CPUs 
and do it cheaply.

Not sure whether or not the issues above are important enough for a session, but
I think a discussion in mail-list will be helpful as well.

0: https://lore.kernel.org/bpf/20220902211058.60789-1-alexei.starovoitov@xxxxxxxxx/
1: https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@xxxxxxxxxxxxxxx/
2: https://lore.kernel.org/bpf/20220924133620.4147153-1-houtao@xxxxxxxxxxxxxxx/
3: https://lore.kernel.org/bpf/20230125143816.721952-1-void@xxxxxxxxxxxxx/
4:
https://lore.kernel.org/bpf/CAADnVQKecUqGF-gLFS5Wiz7_E-cHOkp7NPCUK0woHUmJG6hEuA@xxxxxxxxxxxxxx/
5: https://lore.kernel.org/bpf/2a58c4a8-781f-6d84-e72a-f8b7117762b4@xxxxxxxxxxxxxxx/
6: https://lore.kernel.org/bpf/20221209010947.3130477-1-houtao@xxxxxxxxxxxxxxx/
7: https://lore.kernel.org/bpf/20221014113946.965131-3-houtao@xxxxxxxxxxxxxxx/