Re: [PATCH bpf-next v2] bpf: Add kernel symbol for struct_ops trampoline

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Mon, 4 Nov 2024 09:53:20 -0800

On Mon, Nov 4, 2024 at 3:55 AM Xu Kuohai <xukuohai@xxxxxxxxxxxxxxx> wrote:
>
> >>                  *(unsigned long *)(udata + moff) = prog->aux->id;
> >> +
> >> +               /* init ksym for this trampoline */
> >> +               bpf_struct_ops_ksym_init(prog, image + trampoline_start,
> >> +                                        image_off - trampoline_start,
> >> +                                        ksym++);
> >
> > Thanks for the patch.
> > I think it's overkill to add ksym for each callsite within a single
> > trampoline.
> > 1. The prog name will be next in the stack. No need to duplicate it.
> > 2. ksym-ing callsites this way is quite unusual.
> > 3. consider irq on other insns within a trampline.
> >     The unwinder won't find anything in such a case.
> >
> > So I suggest to add only one ksym that covers the whole trampoline.
> > The name could be "bpf_trampoline_structopsname"
> > that is probably st_ops_desc->type.
> >
>
> IIUC, the "whole trampoline" for a struct_ops is actually the page
> array st_map->image_pages[MAX_TRAMP_IMAGE_PAGES], where each page is
> allocated by arch_alloc_bpf_trampoline(PAGE_SIZE).
>
> Since the virtual addresses of these pages are *NOT* guaranteed to
> be contiguous, I dont think we can create a single ksym for them.
>
> And if we add a ksym for each individual page, it seems we will end
> up with an odd name for each ksym.

I see. Good point. Ok. Let's add ksym for each callback.

> Given that each page consists of one or more bpf trampolines, which
> are not different from bpf trampolines for other prog types, such as
> bpf trampolines for fentry, and since each bpf trampoline for other
> prog types already has a ksym, I think it is not unusual to add ksym
> for each single bpf trampoline in the page.
>
> And, there are no instructions between adjacent bpf trampolines within
> a page, nothing between two trampolines can be interrupted.
>
> For the name, bpf_trampoline_<struct_ops_name>_<member_name>, like
> bpf_trampoline_tcp_congestion_ops_pkts_acked, seems appropriate.

Agree. This naming convention makes sense.
I'd only shorten the prefix to 'bpf_tramp_' or even 'bpf__'
(with double underscore).
It's kinda obvious that it's a trampoline and it's an implementation
detail that doesn't need to be present in the name.

>
> >>          }
> >>
> >>          if (st_ops->validate) {
> >> @@ -790,6 +829,8 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
> >>   unlock:
> >>          kfree(tlinks);
> >>          mutex_unlock(&st_map->lock);
> >> +       if (!err)
> >> +               bpf_struct_ops_map_ksyms_add(st_map);
> >>          return err;
> >>   }
> >>
> >> @@ -883,6 +924,10 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
> >>           */
> >>          synchronize_rcu_mult(call_rcu, call_rcu_tasks);
> >>
> >> +       /* no trampoline in the map is running anymore, delete symbols */
> >> +       bpf_struct_ops_map_ksyms_del(st_map);
> >> +       synchronize_rcu();
> >> +
> >
> > This is substantial overhead and why ?
> > synchronize_rcu_mult() is right above.
> >
>
> I think we should ensure no trampoline is running or could run before
> its ksym is deleted from the symbol table. If this order is not ensured,
> a trampoline can be interrupted by a perf irq after its symbol is deleted,
> resulting a broken stacktrace since the trampoline symbol cound not be
> found by the perf irq handler.
>
> This patch deletes ksyms after synchronize_rcu_mult() to ensure this order.

But the overhead is prohibitive. We had broken stacks with st_ops
for long time, so it may still hit 0.001% where st_ops are being switched
as the comment in bpf_struct_ops_map_free() explains.

As a separate clean up I would switch the freeing to call_rcu_tasks.
Synchronous waiting is expensive.

Martin,

any suggestions?