On Fri, Jul 17, 2020 at 01:06:07AM +0200, Daniel Borkmann wrote: > > + ret = bpf_arch_text_poke(poke->tailcall_bypass, > > + BPF_MOD_JUMP, > > + NULL, bypass_addr); > > + BUG_ON(ret < 0 && ret != -EINVAL); > > + /* let other CPUs finish the execution of program > > + * so that it will not possible to expose them > > + * to invalid nop, stack unwind, nop state > > + */ > > + synchronize_rcu(); > > Very heavyweight that we need to potentially call this /multiple/ times for just a > /single/ map update under poke mutex even ... but agree it's needed here to avoid > racing. :( Yeah. I wasn't clear with my suggestion earlier. I meant to say that synchronize_rcu() can be done between two loops. list_for_each_entry(elem, &aux->poke_progs, list) for (i = 0; i < elem->aux->size_poke_tab; i++) bpf_arch_text_poke(poke->tailcall_bypass, ... synchronize_rcu(); list_for_each_entry(elem, &aux->poke_progs, list) for (i = 0; i < elem->aux->size_poke_tab; i++) bpf_arch_text_poke(poke->poke->tailcall_target, ... Not sure how much better it will be though. text_poke is heavy. I think it's heavier than synchronize_rcu(). Long term we can do batch of text_poke-s. I'm actually fine with above approach of synchronize_rcu() without splitting the loop. This kind of optimizations can be done later as a follow up. I'd really like to land this stuff in this bpf-next cycle. It's a big improvement to tail_calls and bpf2bpf calls.