Re: [PATCH RFC bpf-next 0/3] Execution context callbacks

sdf@xxxxxxxxxx · Tue, 12 Jul 2022 11:07:23 -0700

On 07/11, Delyan Kratunov wrote:
BPF developers are sometimes faced with surprising limitations of the  
execution context
their code runs in. NMI is particularly problematic though userspace data  
access as a whole
has come up as well (e.g. build id not being available).

This series adds bpf_delayed_work_submit which takes a callback function  
and a context pointer
and is able to execute the callback from (initially) a hardirq context.

This is an RFC to answer a few questions on direction:

1. Naming is intentionally bad and something I'd like to bikeshed a bit.
"bpf_(defer|submit)_work" was my first instinct but that has workqueue  
connotations in the kernel.

2. The callback arguments need to be in a map. We can currently express  
helper arguments taking a
pointer to a map value but not a pointer to _within_ a map value. Should  
we add a new argument
type or should we just pass the map value pointer to the callback?

Passing map value pointer (as you do in the selftest) seems fine; do
you think we need more flexibility here?

3. A lot of the map handling code is verbatim from bpf_timer. This feels  
icky but I'm not sure if it
justifies a refactor quite yet. Opinions welcome.

+1, it does seem very close to a timer with expiry time == 0.

I don't know what's the exact usecase you're trying to solve exactly, but
have you though of maybe initially supporting something like:

bpf_timer_init(&timer, map, SOME_NEW_DEFERRED_NMI_ONLY_FLAG);
bpf_timer_set_callback(&timer, cg);
bpf_timer_start(&timer, 0, 0);

If you init a timer with that special flag, I'm assuming you can have
special cases in the existing helpers to simulate the delayed work?
Then, the verifier changes should be minimal it seems.

OTOH, having a separate set of helpers seems more clear API-wise :-/

4. This functionality is implemented as a single helper call (no matching  
bpf_delayed_work_init). In practice,
this means that we can't implement the map->usercnt check that  
bpf_timer_start performs to ensure the
map is referenced from userspace. However, given that a) we wait for  
pending work before releasing the
bpf_prog, b) the map will be in the bpf_prog's used_maps, and c) the map  
free path does not need to release
any external resources, and d) the bpf_delayed_work items bump the prog  
refcnt, I think we can keep this mechanism
a single call.

I'd like to get this right from the start, so do let me know if I'm  
missing potential execution
contexts that we can't really wait to drain from the bpf_prog free path.

5. This mechanism generalizes to other contexts (e.g., sleepable context  
on the way back to userspace
a-la set_thread_flag(TIF_UPROBE)), by means of adding the  
bpf_delayed_work items to other llist_heads.
E.g., we can keep the llist_heads in task_local_storage or in per-cpu  
structures. I can't think of
anything that requires a more complicated approach (or reserved space in  
the structs) but do let me
know if I'm wrong.

6. Lastly, the llist approach was dictated by the NMI constraints. RCU  
lists are out because they need
to synchronize_rcu when splicing from one head to another.

Thanks,
Delyan

Delyan Kratunov (3):
   bpf: allow maps to hold bpf_delayed_work fields
   bpf: add delayed_work mechanism
   selftests: delayed_work tests

  include/linux/bpf.h                           |  22 ++-
  include/linux/btf.h                           |   1 +
  include/uapi/linux/bpf.h                      |  36 +++++
  kernel/bpf/btf.c                              |  21 +++
  kernel/bpf/core.c                             |   8 ++
  kernel/bpf/helpers.c                          |  92 ++++++++++++
  kernel/bpf/syscall.c                          |  24 +++-
  kernel/bpf/verifier.c                         | 132 +++++++++++++++++-
  scripts/bpf_doc.py                            |   2 +
  tools/include/uapi/linux/bpf.h                |  35 +++++
  .../selftests/bpf/prog_tests/delayed_work.c   |  29 ++++
  .../selftests/bpf/progs/delayed_irqwork.c     |  59 ++++++++
  12 files changed, 457 insertions(+), 4 deletions(-)
  create mode 100644 tools/testing/selftests/bpf/prog_tests/delayed_work.c
  create mode 100644 tools/testing/selftests/bpf/progs/delayed_irqwork.c

--
2.36.1