Re: [PATCH RFC bpf-next 0/3] Execution context callbacks

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jul 19, 2022 at 10:12:57PM +0000, Delyan Kratunov wrote:
> On Tue, 2022-07-19 at 12:02 -0700, Alexei Starovoitov wrote:
> > On Fri, Jul 15, 2022 at 06:28:20PM +0000, Delyan Kratunov wrote:
> > > On Thu, 2022-07-14 at 18:51 -0700, Alexei Starovoitov wrote:
> > > > On Tue, Jul 12, 2022 at 06:42:52PM +0000, Delyan Kratunov wrote:
> > > > > 
> > > > > > but have you though of maybe initially supporting something like:
> > > > > > 
> > > > > > bpf_timer_init(&timer, map, SOME_NEW_DEFERRED_NMI_ONLY_FLAG);
> > > > > > bpf_timer_set_callback(&timer, cg);
> > > > > > bpf_timer_start(&timer, 0, 0);
> > > > > > 
> > > > > > If you init a timer with that special flag, I'm assuming you can have
> > > > > > special cases in the existing helpers to simulate the delayed work?
> > > > > 
> > > > > Potentially but I have some reservations about drawing this equivalence.
> > > > 
> > > > hrtimer api has various: flags. soft vs hard irq, pinned and not.
> > > > So the suggestion to treat irq_work callback as special timer flag
> > > > actually fits well.
> > > > 
> > > > bpf_timer_init + set_callback + start can be a static inline function
> > > > named bpf_work_submit() in bpf_helpers.h
> > > > (or some new file that will mark the beginning libc-bpf library).
> > > > Reusing struct bpf_timer and adding zero-delay callback could probably be
> > > > easier for users to learn and consume.
> > > 
> > > To clarify, we're talking about 1) making bpf_timer nmi-safe for _some_ but not all
> > > combinations of parameters and 2) adding new flags to specify an execution context?
> > > It's achievable but it's hard to see how it's the superior solution here.
> > > 
> > > > 
> > > > Separately:
> > > > +struct bpf_delayed_work {
> > > > +       __u64 :64;
> > > > +       __u64 :64;
> > > > +       __u64 :64;
> > > > +       __u64 :64;
> > > > +       __u64 :64;
> > > > +} __attribute__((aligned(8)));
> > > > is not extensible.
> > > > It would be better to add indirection to allow kernel side to grow
> > > > independently from amount of space consumed in a map value.
> > > 
> > > Fair point, I was wondering what to do with it - storing just a pointer sounds
> > > reasonable.
> > > 
> > > > Can you think of a way to make irq_work/sleepable callback independent of maps?
> > > > Assume bpf_mem_alloc is already available and NMI prog can allocate a typed object.
> > > > The usage could be:
> > > > struct my_work {
> > > >   int a;
> > > >   struct task_struct __kptr_ref *t;
> > > > };
> > > > void my_cb(struct my_work *w);
> > > > 
> > > > struct my_work *w = bpf_mem_alloc(allocator, bpf_core_type_id_local(*w));
> > > > w->t = ..;
> > > > bpf_submit_work(w, my_cb, SLEEPABLE | IRQ_WORK);
> > > > 
> > > > Am I day dreaming? :)
> > > 
> > > Nothing wrong with dreaming of a better future :) 
> > > 
> > > (I'm assuming you're thinking of bpf_mem_alloc being fronted by the allocator you
> > > recently sent to the list.)
> > > 
> > > On a first pass, here are my concerns:
> > > 
> > > A program and its maps can guarantee a certain amount of storage for work items.
> > > Sizing that storage is difficult but it is yours alone to use. The freelist allocator
> > > can be transiently drained by other programs and starve you of this utility. This is
> > > a new failure mode, so it's worth talking about.
> > 
> > That would be the issue only when progs deliberately share the allocator.
> > In this stmt:
> > struct my_work *w = bpf_mem_alloc(allocator, bpf_core_type_id_local(*w));
> > The 'allocator' can be unique for each prog or shared across few progs in the same .c file.
> > I wasn't planning to support one global allocator.
> > Just like one global hash map doesn't quite make sense.
> > The user has to create an allocator first, get it connected with memcg,
> > and use the explicit one in their bpf progs/maps.
> > 
> > > With a generic allocator mechanism, we'll have a hard time enforcing the can't-load-
> > > or-store-into-special-fields logic. I like that guardrail and I'm not sure how we'd
> > > achieve the same guarantees. (In your snippet, we don't have the llist_node on the
> > > work item - do we wrap my_work into something else internally? That would hide the
> > > fields that need protecting at the expense of an extra bpf_mem_alloc allocation.)
> > 
> > bpf_mem_alloc will return referenced PTR_TO_BTF_ID.
> > Every field in this structure is typed. So it's trivial for the verifier to make
> > some of them read only or not accesible at all.
> > 'struct my_work' can have an explicit struct bpf_delayed_work field. Example:
> > struct my_work {
> >   struct bpf_delayed_work work; // not accessible by prog
> >   int a; // scalar read/write
> >   struct task_struct __kptr_ref *t;  // kptr semantics
> > };
> 
> Sure, anything is possible, it's just more complexity and these checks are not
> exactly easy to follow right now. 
> 
> Alternatively, we could do the classic allocator thing and allocate accounting space
> before the pointer we return. Some magic flag could then expand the space enough to
> use for submit_work. Some allocations would be bumped to a higher bucket but that's
> okay because it would be conststent overhead for those allocation sites.

Technically we can, but that would be a departure from what we already do.
bpf_spin_lock, bpf_timer, __kptr are normal part of struct-s with different access
restrictions. 'struct bpf_delayed_work' shouldn't be any different.

Another approach would be to let bpf prog allocate 'struct my_work' without
any special fields. Then use nmi-safe allocator inside bpf_submit_work, hide
it completely from bpf side and auto-free after callback is done.
But extra alloc is a performance hit and overall it will be an unusual hack.

May be we can allow bpf_submit_work() to work with referenced ptr_to_btf_id
like above and with normal map value similar to what you've implemented?
We would need to somehow make sure that container_of() operation to cast from
&work either to allocated ptr_to_btf_id or to map value works in both cases.
That would be the most flexible solution and will resemble kernel programming
style the best.

> > 
> > > Managing the storage returned from bpf_mem_alloc is of course also a concern. We'd
> > > need to treat bpf_submit_work as "releasing" it (really, taking ownership). This path
> > > means more lifecycle analysis in the verifier and explicit and implicit free()s.
> > 
> > What is the actual concern?
> > bpf_submit_work will have clear "release" semantics. The verifier already supports it.
> > The 'my_cb' callback will receive reference PTR_TO_BTF_ID as well and would
> > have to release it with bpf_mem_free(ma, w).
> > Here is more complete proposal:
> > 
> > struct {
> >         __uint(type, BPF_MEM_ALLOC);
> > } allocator SEC(".maps");
> 
> I like this, so long as we pre-allocate enough to submit more sleepable work
> immediately - the first work item the program submits could then prefill more items.
> 
> For an even better experience, it would be great if we could specify in the map
> definition the number of items of size X we'll need. If we give that lever to the
> developer, they can then use it so they never have to orchestrate sleepable work to
> call bpf_mem_prealloc explicitly.

Agree. That's the idea. Will work on it.

> 
> > 
> > struct my_work {
> >   struct bpf_delayed_work work;
> >   int a;
> >   struct task_struct __kptr_ref *t;
> > };
> > 
> > void my_cb(struct my_work *w)
> > {
> >   // access w
> >   bpf_mem_free(&allocator, w);
> > }
> > 
> > void bpf_prog(...)
> > {
> >   struct my_work *w = bpf_mem_alloc(&allocator, bpf_core_type_id_local(*w));
> >   w->t = ..;
> >   bpf_submit_work(w, my_cb, USE_IRQ_WORK);
> > }
> > 
> > > I'm not opposed to it overall - the developer experience is very familiar - but I am
> > > primarily worried that allocator failures will be in the same category of issues as
> > > the hash map collisions for stacks. If you want reliability, you just don't use that
> > > type of map - what's the alternative in this hypothetical bpf_mem_alloc future?
> > 
> > Reliability of allocation is certianly necessary.
> > bpf_mem_alloc will have an ability to _synchronously_ preallocate into freelist
> > from sleepable context, so bpf prog will have full control of that free list.
> 
> I think having the map initialized and prefilled on load and having sleepable work
> from the first version of this mechanism becomes a requirement of this design. Having
> the prefill requirements (number of items and size) on the map definition removes the
> requirement to have sleepable work from day one.

I'm not sure why 'sleepable' is a requirement. irq_work will be able to do
synchronous prefill with GFP_NOWAIT. sleepable callback will be able to do
synchronous prefill with GFP_KERNEL. There is a difference, of course,
but it's not a blocker.

> How do you want to sequence this? Do you plan to do the work to expose bpf_mem_alloc
> to programs as part of the initial series or as a later followup? 

Currently thinking as a follow up.
If you have cycles maybe you can help ?
bpf_mem_alloc/free internals are tested and usable already. prefill is not implemented yet.
But the work to do bpf_mem_alloc helper and to expose allocator as a special kind of map
can start already.



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux