On Tue, Jul 19, 2022 at 10:12:57PM +0000, Delyan Kratunov wrote: > On Tue, 2022-07-19 at 12:02 -0700, Alexei Starovoitov wrote: > > On Fri, Jul 15, 2022 at 06:28:20PM +0000, Delyan Kratunov wrote: > > > On Thu, 2022-07-14 at 18:51 -0700, Alexei Starovoitov wrote: > > > > On Tue, Jul 12, 2022 at 06:42:52PM +0000, Delyan Kratunov wrote: > > > > > > > > > > > but have you though of maybe initially supporting something like: > > > > > > > > > > > > bpf_timer_init(&timer, map, SOME_NEW_DEFERRED_NMI_ONLY_FLAG); > > > > > > bpf_timer_set_callback(&timer, cg); > > > > > > bpf_timer_start(&timer, 0, 0); > > > > > > > > > > > > If you init a timer with that special flag, I'm assuming you can have > > > > > > special cases in the existing helpers to simulate the delayed work? > > > > > > > > > > Potentially but I have some reservations about drawing this equivalence. > > > > > > > > hrtimer api has various: flags. soft vs hard irq, pinned and not. > > > > So the suggestion to treat irq_work callback as special timer flag > > > > actually fits well. > > > > > > > > bpf_timer_init + set_callback + start can be a static inline function > > > > named bpf_work_submit() in bpf_helpers.h > > > > (or some new file that will mark the beginning libc-bpf library). > > > > Reusing struct bpf_timer and adding zero-delay callback could probably be > > > > easier for users to learn and consume. > > > > > > To clarify, we're talking about 1) making bpf_timer nmi-safe for _some_ but not all > > > combinations of parameters and 2) adding new flags to specify an execution context? > > > It's achievable but it's hard to see how it's the superior solution here. > > > > > > > > > > > Separately: > > > > +struct bpf_delayed_work { > > > > + __u64 :64; > > > > + __u64 :64; > > > > + __u64 :64; > > > > + __u64 :64; > > > > + __u64 :64; > > > > +} __attribute__((aligned(8))); > > > > is not extensible. > > > > It would be better to add indirection to allow kernel side to grow > > > > independently from amount of space consumed in a map value. > > > > > > Fair point, I was wondering what to do with it - storing just a pointer sounds > > > reasonable. > > > > > > > Can you think of a way to make irq_work/sleepable callback independent of maps? > > > > Assume bpf_mem_alloc is already available and NMI prog can allocate a typed object. > > > > The usage could be: > > > > struct my_work { > > > > int a; > > > > struct task_struct __kptr_ref *t; > > > > }; > > > > void my_cb(struct my_work *w); > > > > > > > > struct my_work *w = bpf_mem_alloc(allocator, bpf_core_type_id_local(*w)); > > > > w->t = ..; > > > > bpf_submit_work(w, my_cb, SLEEPABLE | IRQ_WORK); > > > > > > > > Am I day dreaming? :) > > > > > > Nothing wrong with dreaming of a better future :) > > > > > > (I'm assuming you're thinking of bpf_mem_alloc being fronted by the allocator you > > > recently sent to the list.) > > > > > > On a first pass, here are my concerns: > > > > > > A program and its maps can guarantee a certain amount of storage for work items. > > > Sizing that storage is difficult but it is yours alone to use. The freelist allocator > > > can be transiently drained by other programs and starve you of this utility. This is > > > a new failure mode, so it's worth talking about. > > > > That would be the issue only when progs deliberately share the allocator. > > In this stmt: > > struct my_work *w = bpf_mem_alloc(allocator, bpf_core_type_id_local(*w)); > > The 'allocator' can be unique for each prog or shared across few progs in the same .c file. > > I wasn't planning to support one global allocator. > > Just like one global hash map doesn't quite make sense. > > The user has to create an allocator first, get it connected with memcg, > > and use the explicit one in their bpf progs/maps. > > > > > With a generic allocator mechanism, we'll have a hard time enforcing the can't-load- > > > or-store-into-special-fields logic. I like that guardrail and I'm not sure how we'd > > > achieve the same guarantees. (In your snippet, we don't have the llist_node on the > > > work item - do we wrap my_work into something else internally? That would hide the > > > fields that need protecting at the expense of an extra bpf_mem_alloc allocation.) > > > > bpf_mem_alloc will return referenced PTR_TO_BTF_ID. > > Every field in this structure is typed. So it's trivial for the verifier to make > > some of them read only or not accesible at all. > > 'struct my_work' can have an explicit struct bpf_delayed_work field. Example: > > struct my_work { > > struct bpf_delayed_work work; // not accessible by prog > > int a; // scalar read/write > > struct task_struct __kptr_ref *t; // kptr semantics > > }; > > Sure, anything is possible, it's just more complexity and these checks are not > exactly easy to follow right now. > > Alternatively, we could do the classic allocator thing and allocate accounting space > before the pointer we return. Some magic flag could then expand the space enough to > use for submit_work. Some allocations would be bumped to a higher bucket but that's > okay because it would be conststent overhead for those allocation sites. Technically we can, but that would be a departure from what we already do. bpf_spin_lock, bpf_timer, __kptr are normal part of struct-s with different access restrictions. 'struct bpf_delayed_work' shouldn't be any different. Another approach would be to let bpf prog allocate 'struct my_work' without any special fields. Then use nmi-safe allocator inside bpf_submit_work, hide it completely from bpf side and auto-free after callback is done. But extra alloc is a performance hit and overall it will be an unusual hack. May be we can allow bpf_submit_work() to work with referenced ptr_to_btf_id like above and with normal map value similar to what you've implemented? We would need to somehow make sure that container_of() operation to cast from &work either to allocated ptr_to_btf_id or to map value works in both cases. That would be the most flexible solution and will resemble kernel programming style the best. > > > > > Managing the storage returned from bpf_mem_alloc is of course also a concern. We'd > > > need to treat bpf_submit_work as "releasing" it (really, taking ownership). This path > > > means more lifecycle analysis in the verifier and explicit and implicit free()s. > > > > What is the actual concern? > > bpf_submit_work will have clear "release" semantics. The verifier already supports it. > > The 'my_cb' callback will receive reference PTR_TO_BTF_ID as well and would > > have to release it with bpf_mem_free(ma, w). > > Here is more complete proposal: > > > > struct { > > __uint(type, BPF_MEM_ALLOC); > > } allocator SEC(".maps"); > > I like this, so long as we pre-allocate enough to submit more sleepable work > immediately - the first work item the program submits could then prefill more items. > > For an even better experience, it would be great if we could specify in the map > definition the number of items of size X we'll need. If we give that lever to the > developer, they can then use it so they never have to orchestrate sleepable work to > call bpf_mem_prealloc explicitly. Agree. That's the idea. Will work on it. > > > > > struct my_work { > > struct bpf_delayed_work work; > > int a; > > struct task_struct __kptr_ref *t; > > }; > > > > void my_cb(struct my_work *w) > > { > > // access w > > bpf_mem_free(&allocator, w); > > } > > > > void bpf_prog(...) > > { > > struct my_work *w = bpf_mem_alloc(&allocator, bpf_core_type_id_local(*w)); > > w->t = ..; > > bpf_submit_work(w, my_cb, USE_IRQ_WORK); > > } > > > > > I'm not opposed to it overall - the developer experience is very familiar - but I am > > > primarily worried that allocator failures will be in the same category of issues as > > > the hash map collisions for stacks. If you want reliability, you just don't use that > > > type of map - what's the alternative in this hypothetical bpf_mem_alloc future? > > > > Reliability of allocation is certianly necessary. > > bpf_mem_alloc will have an ability to _synchronously_ preallocate into freelist > > from sleepable context, so bpf prog will have full control of that free list. > > I think having the map initialized and prefilled on load and having sleepable work > from the first version of this mechanism becomes a requirement of this design. Having > the prefill requirements (number of items and size) on the map definition removes the > requirement to have sleepable work from day one. I'm not sure why 'sleepable' is a requirement. irq_work will be able to do synchronous prefill with GFP_NOWAIT. sleepable callback will be able to do synchronous prefill with GFP_KERNEL. There is a difference, of course, but it's not a blocker. > How do you want to sequence this? Do you plan to do the work to expose bpf_mem_alloc > to programs as part of the initial series or as a later followup? Currently thinking as a follow up. If you have cycles maybe you can help ? bpf_mem_alloc/free internals are tested and usable already. prefill is not implemented yet. But the work to do bpf_mem_alloc helper and to expose allocator as a special kind of map can start already.