Re: [PATCH v3 bpf-next 00/15] bpf: BPF specific memory allocator, UAPI in particular

Delyan Kratunov <delyank@xxxxxx> · Wed, 31 Aug 2022 21:02:53 +0000

On Wed, 2022-08-31 at 11:57 -0700, Alexei Starovoitov wrote:
> !-------------------------------------------------------------------|
>   This Message Is From an External Sender
> 
> > -------------------------------------------------------------------!
> 
> On Wed, Aug 31, 2022 at 05:38:15PM +0000, Delyan Kratunov wrote:
> > 
> > Overall, this design (or maybe the way it's presented here) conflates a few ideas.
> > 
> > 1) The extensions to expose and customize map's internal element allocator are fine
> > independently of even this patchset.
> > 
> > 2) The idea that kptrs in a map need to have a statically identifiable allocator is
> > taken as an axiom, and then expanded to its extreme (single allocator per map as
> > opposed to the smarter verifier schemes). I still contest that this is not the case
> > and the runtime overhead it avoids is paid back in bad developer experience multiple
> > times over.
> > 
> > 3) The idea that allocators can be merged between elements and kptrs is independent
> > of the static requirements. If the map's internal allocator is exposed per 1), we can
> > still use it to allocate kptrs but not require that all kptrs in a map are from the
> > same allocator.
> > 
> > Going this coarse in the API is easy for us but fundamentally more limiting for
> > developers. It's not hard to imagine situations where the verifier dependency
> > tracking or runtime lifetime tracking would allow for pinned maps to be retained but
> > this scheme would require new maps entirely. (Any situation where you just refactored
> > the implicit allocator out to share it, for example)
> > 
> > I also don't think that simplicity for us (a one time implementation cost +
> > continuous maintenance cost) trumps over long term developer experience (a much
> > bigger implementation cost over a much bigger time span).
> 
> It feels we're thinking about scope and use cases for the allocator quite
> differently and what you're seeing as 'limiting developer choices' to me looks
> like 'not a limiting issue at all'. To me the allocator is one
> jemalloc/tcmalloc instance. One user space application with multiple threads,
> lots of maps and code is using exactly one such allocator. The allocator
> manages all the memory of user space process. In bpf land we don't have a bpf
> process. We don't have a bpf name space either.  A loose analogy would be a set
> of programs and maps managed by one user space agent. The bpf allocator would
> manage all the memory of these maps and programs and provide a "memory namespace"
> for this set of programs. Another user space agent with its own programs
> would never want to share the same allocator. In user space a chunk of memory
> could be mmap-ed between different process to share the data, but you would never
> put a tcmalloc over such memory to be an allocator for different processes.
> 
> More below.
> 
> > So far, my ranked choice vote is:
> > 
> > 1) maximum flexibility and runtime live object counts (with exposed allocators, I
> > like the merging)
> > 2) medium flexibility with per-field allocator tracking in the verifier and the
> > ability to lose the association once programs are unloaded and values are gone. This
> > also works better with exposed allocators since they are implicitly pinned and would
> > be usable to store values in another map.
> > 3) minimum flexibility with static whole-map kptr allocators
> 
> The option 1 flexibility is necessary when allocator is seen as a private pool
> of objects of given size. Like kernel's kmem_cache instance.
> I don't think we quite there yet.

If we're not there, we should aim to get there :) 

> There is a need to "preallocate this object from sleepable context,
> so the prog has a guaranteed chunk of memory to use in restricted context",
> but, arguably, it's not a job of bpf allocator. 

Leaving it to the programs is worse for memory usage (discussed below).

> bpf prog can allocate an object, stash it into kptr, and use it later.

Given that tracing programs can't really maintain their own freelists safely (I think
they're missing the building blocks - you can't cmpxchg kptrs), I do feel like
isolated allocators are a requirement here. Without them, allocations can fail and
there's no way to write a reliable program.

*If* we ensure that you can build a usable freelist out of allocator-backed memory
for (a set of) nmi programs, then I can maybe get behind this (but there's other
reasons not to do this).

> So option 3 doesn't feel less flexible to me. imo the whole-map-allocator is
> more than we need. Ideally it would be easy to specifiy one single
> allocator for all maps and progs in a set of .c files. Sort-of a bpf package.
> In other words one bpf allocator per bpf "namespace" is more than enough.

_Potentially_. Programs need to know that when they reserved X objects, they'll have
them available at a later time and any sharing with other programs can remove this
property. A _set_ of programs can in theory determine the right prefill levels, but
this is certainly easier to reason about on a per-program basis, given that programs
will run at different rates.

> Program authors shouldn't be creating allocators left and right. All these
> free lists will waste memory.
> btw I've added an extra patch to bpf_mem_alloc series:
> https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/commit/?h=memalloc&id=6a586327a270272780bdad7446259bbe62574db1
> that removes kmem_cache usage.
> Turned out (hindsight 20/20) kmem_cache for each bpf map was a bad idea.
> When free_lists are not shared they will similarly waste memory.
> In user space the C code just does malloc() and the memory is isolated per process.
> Ideally in bpf world the programs would just do:
> bpf_mem_alloc(btf_type_id_local(struct foo));
> without specifying an allocator, but that would require one global allocator
> for all bpf programs in the kernel which is probably not a direction we should go ?

Why does it require a global allocator? For example, you can have each program have
its own internal allocator and with runtime live counts, this API is very achievable.
Once the program unloads, you can drain the freelists, so most allocator memory does
not have to live as long as the longest-lived object from that allocator. In
addition, all allocators can share a global freelist too, so chunks released after
the program unloads get a chance to be reused.

> So the programs have to specify an allocator to use in bpf_mem_alloc(),
> but it should be one for all progs, maps in a bpf-package/set/namespace.
> If it's easy for programs to specify a bunch of allocators, like one for each program,
> or one for each btf_type_id the bpf kernel infra would be required to merge
> these allocators from day one. 

How is having one allocator per program different from having one allocator per set
of programs, with per-program bpf-side freelists? The requirement that some (most?)
programs need deterministic access to their freelists is still there, no matter the
number of allocators. If we fear that the default freelist behavior will waste
memory, then the defaults need to be aggressively conservative, with programs being
able to adjust them.

Besides, if we punt the freelists to bpf, then we get absolutely no control over the
memory usage, which is strictly worse for us (and worse developer experience on top).

> (The profileration of kmem_cache-s in the past
> forced merging of them). By restricting bpf program choices with allocator-per-map
> (this option 3) we're not only making the kernel side to do less work
> (no run-time ref counts, no merging is required today), we're also pushing
> bpf progs to use memory concious choices.

This is conflating "there needs to be a limit on memory stuck in freelists" with "you
can only store kptrs from one allocator in each map." The former practically
advocates for freelists to _not_ be hand-rolled inside bpf progs. I still disagree
with the latter - it's coming strictly from the desire to have static mappings
between object storage and allocators; it's not coming from a memory usage need, it
only avoids runtime live object counts.

> Having said all that maybe one global allocator is not such a bad idea.

It _is_ a bad idea because it doesn't have freelist usage determinism. I do, however,
think there is value in having precise and conservative freelist policies, along with
a global freelist for overflow and draining after program unload. The latter would
allow us to share memory between allocators without sacrificing per-allocator
freelist determinism, especially if paired with very static (but configurable)
freelist thresholds.

-- Delyan