Re: [PATCH bpf-next v2 1/2] bpf, x64: Fix tailcall hierarchy

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Feb 26, 2024 at 7:32 AM Leon Hwang <hffilwlqm@xxxxxxxxx> wrote:
>
>
>
> On 2024/2/24 00:35, Alexei Starovoitov wrote:
> > On Fri, Feb 23, 2024 at 7:30 AM Leon Hwang <hffilwlqm@xxxxxxxxx> wrote:
> >>
> >>
> >>
> >> On 2024/2/23 12:06, Pu Lehui wrote:
> >>>
> >>>
> >>> On 2024/2/22 16:52, Leon Hwang wrote:
> >>
> >> [SNIP]
> >>
> >>>>   }
> >>>>   @@ -575,6 +574,54 @@ static void emit_return(u8 **pprog, u8 *ip)
> >>>>       *pprog = prog;
> >>>>   }
> >>>>   +DEFINE_PER_CPU(u32, bpf_tail_call_cnt);
> >>>
> >>> Hi Leon, the solution is really simplifies complexity. If I understand
> >>> correctly, this TAIL_CALL_CNT becomes the system global wise, not the
> >>> prog global wise, but before it was limiting the TCC of entry prog.
> >>>
> >>
> >> Correct. It becomes a PERCPU global variable.
> >>
> >> But, I think this solution is not robust enough.
> >>
> >> For example,
> >>
> >> time      prog1           prog1
> >> ==================================>
> >> line              prog2
> >>
> >> this is a time-line on a CPU. If prog1 and prog2 have tailcalls to run,
> >> prog2 will reset the tail_call_cnt on current CPU, which is used by
> >> prog1. As a result, when the CPU schedules from prog2 to prog1,
> >> tail_call_cnt on current CPU has been reset to 0, no matter whether
> >> prog1 incremented it.
> >>
> >> The tail_call_cnt reset issue happens too, even if PERCPU tail_call_cnt
> >> moves to 'struct bpf_prog_aux', i.e. one kprobe bpf prog can be
> >> triggered on many functions e.g. cilium/pwru. However, this moving is
> >> better than this solution.
> >
> > kprobe progs are not preemptable.
> > There is bpf_prog_active that disallows any recursion.
> > Moving this percpu count to prog->aux should solve it.
> >
> >> I think, my previous POC of 'struct bpf_prog_run_ctx' would be better.
> >> I'll resend it later, with some improvements.
> >
> > percpu approach is still prefered, since it removes rax mess.
>
> It seems that we cannot remove rax.
>
> Let's take a look at tailcall3.c selftest:
>
> struct {
>         __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
>         __uint(max_entries, 1);
>         __uint(key_size, sizeof(__u32));
>         __uint(value_size, sizeof(__u32));
> } jmp_table SEC(".maps");
>
> int count = 0;
>
> SEC("tc")
> int classifier_0(struct __sk_buff *skb)
> {
>         count++;
>         bpf_tail_call_static(skb, &jmp_table, 0);
>         return 1;
> }
>
> SEC("tc")
> int entry(struct __sk_buff *skb)
> {
>         bpf_tail_call_static(skb, &jmp_table, 0);
>         return 0;
> }
>
> Here, classifier_0 is populated to jmp_table.
>
> Then, at classifier_0's prologue, when we 'move rax,
> classifier_0->tail_call_cnt' in order to use the PERCPU tail_call_cnt in
> 'struct bpf_prog' for current run-time, it fails to run selftests. It's
> because the tail_call_cnt is not from the entry bpf prog. The
> tail_call_cnt from the entry bpf prog is the expected one, even though
> classifier_0 bpf prog runs. (It seems that it's unnecessary to provide
> the diff of the exclusive approach with PERCPU tail_call_cnt.)

Not following.
With percpu tail_call_cnt, does classifier_0 loop forever ? I doubt it.
You mean expected 'count' value is different?
The test expected 33 and instead it's ... what?

> +               if (tail_call_reachable && !is_subprog) {
> +                       /* mov rax, entry->tail_call_cnt */
> +                       EMIT2_off64(0x48, 0xB8, (u64) entry->tail_call_cnt);
> +                       /* call bpf_tail_call_cnt_prepare */
> +                       emit_call(&prog, bpf_tail_call_cnt_prepare,
> +                                 ip + (prog - start));
> +               } else {
>                         /* Keep the same instruction layout. */
> -                       EMIT2(0x66, 0x90); /* nop2 */
> +                       emit_nops(&prog, 10 + X86_PATCH_SIZE);

As mentioned before... such "fix" is not acceptable.
We will not be penalizing all progs this way.

How about we make percpu tail_call_cnt per prog_array map,
then remove rax as this patch does,
but instead of zeroing tcc on entry, zero it on exit.
While processing bpf_exit add:
if (tail_call_reachable)
  emit_call(&prog, bpf_tail_call_cnt_prepare,...)

if prog that tailcalls get preempted on this cpu and
another prog starts that also tailcalls it won't zero the count.
This way we can remove nop5 from prologue too.

The preempted prog will eventually zero ttc on exit,
and earlier prog that uses the same prog_array can tail call more
than 32 times, but it cannot be abused reliably,
since preemption is non deterministic.





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux