On 11/5/2024 1:53 AM, Alexei Starovoitov wrote:
On Mon, Nov 4, 2024 at 3:55 AM Xu Kuohai <xukuohai@xxxxxxxxxxxxxxx> wrote:
*(unsigned long *)(udata + moff) = prog->aux->id;
+
+ /* init ksym for this trampoline */
+ bpf_struct_ops_ksym_init(prog, image + trampoline_start,
+ image_off - trampoline_start,
+ ksym++);
Thanks for the patch.
I think it's overkill to add ksym for each callsite within a single
trampoline.
1. The prog name will be next in the stack. No need to duplicate it.
2. ksym-ing callsites this way is quite unusual.
3. consider irq on other insns within a trampline.
The unwinder won't find anything in such a case.
So I suggest to add only one ksym that covers the whole trampoline.
The name could be "bpf_trampoline_structopsname"
that is probably st_ops_desc->type.
IIUC, the "whole trampoline" for a struct_ops is actually the page
array st_map->image_pages[MAX_TRAMP_IMAGE_PAGES], where each page is
allocated by arch_alloc_bpf_trampoline(PAGE_SIZE).
Since the virtual addresses of these pages are *NOT* guaranteed to
be contiguous, I dont think we can create a single ksym for them.
And if we add a ksym for each individual page, it seems we will end
up with an odd name for each ksym.
I see. Good point. Ok. Let's add ksym for each callback.
Given that each page consists of one or more bpf trampolines, which
are not different from bpf trampolines for other prog types, such as
bpf trampolines for fentry, and since each bpf trampoline for other
prog types already has a ksym, I think it is not unusual to add ksym
for each single bpf trampoline in the page.
And, there are no instructions between adjacent bpf trampolines within
a page, nothing between two trampolines can be interrupted.
For the name, bpf_trampoline_<struct_ops_name>_<member_name>, like
bpf_trampoline_tcp_congestion_ops_pkts_acked, seems appropriate.
Agree. This naming convention makes sense.
I'd only shorten the prefix to 'bpf_tramp_' or even 'bpf__'
(with double underscore).
It's kinda obvious that it's a trampoline and it's an implementation
detail that doesn't need to be present in the name.
OK, 'bpf__' looks great.
}
if (st_ops->validate) {
@@ -790,6 +829,8 @@ static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
unlock:
kfree(tlinks);
mutex_unlock(&st_map->lock);
+ if (!err)
+ bpf_struct_ops_map_ksyms_add(st_map);
return err;
}
@@ -883,6 +924,10 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
*/
synchronize_rcu_mult(call_rcu, call_rcu_tasks);
+ /* no trampoline in the map is running anymore, delete symbols */
+ bpf_struct_ops_map_ksyms_del(st_map);
+ synchronize_rcu();
+
This is substantial overhead and why ?
synchronize_rcu_mult() is right above.
I think we should ensure no trampoline is running or could run before
its ksym is deleted from the symbol table. If this order is not ensured,
a trampoline can be interrupted by a perf irq after its symbol is deleted,
resulting a broken stacktrace since the trampoline symbol cound not be
found by the perf irq handler.
This patch deletes ksyms after synchronize_rcu_mult() to ensure this order.
But the overhead is prohibitive. We had broken stacks with st_ops
for long time, so it may still hit 0.001% where st_ops are being switched
as the comment in bpf_struct_ops_map_free() explains.
Got it
As a separate clean up I would switch the freeing to call_rcu_tasks.
Synchronous waiting is expensive.
Martin,
any suggestions?