Re: [PATCH bpf-next 4/5] bpf: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself

Hao Luo <haoluo@xxxxxxxxxx> · Thu, 22 Sep 2022 18:59:40 -0700

On Thu, Sep 22, 2022 at 6:12 PM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote:
>
> On 9/22/22 5:12 PM, Alexei Starovoitov wrote:
[...]
> > Instead of adding 4 bytes for enum in patch 3
> > (which will be 8 bytes due to alignment)
> > and abusing bpf_cookie here
> > (which struct_ops bpf prog might eventually read and be surprised
> > to find sk pointer in there)
> > how about adding 'struct task_struct *saved_current' as another arg
> > to bpf_tramp_run_ctx ?
> > Always store the current task in there in prog_entry_struct_ops
> > and then compare it here in this specialized bpf_init_ops_setsockopt?
> >
> > Or maybe always check in enter_prog_struct_ops:
> > if (container_of(current->bpf_ctx, struct bpf_tramp_run_ctx,
> > run_ctx)->saved_current == current) // goto out since recursion?
> > it will prevent issues in case we don't know about and will
> > address the good recursion case as explained in patch 1?
> > I'm assuming 2nd ssthresh runs in a different task..
> > Or is it actually the same task?
>
> The 2nd ssthresh() should run in the same task but different sk.  The
> first ssthresh(sk[1]) was run in_task() context and then got
> interrupted.  The softirq then handles the rcv path which just happens
> to also call ssthresh(sk[2]) in the unlikely pkt-loss case. It is like
> ssthresh(sk[1]) => softirq => ssthresh(sk[2]).
>
> The tcp-cc ops can recur but cannot recur on the same sk because it
> requires to hold the sk lock, so the patch remembers what was the
> previous sk to ensure it does not recur on the same sk.  Then it needs
> to peek into the previous run ctx which may not always be
> bpf_trump_run_ctx.  eg. a cg bpf prog (with bpf_cg_run_ctx) can call
> bpf_setsockopt(TCP_CONGESTION, "a_bpf_tcp_cc") which then will call the
> a_bpf_tcp_cc->init().  It needs a bpf_run_ctx_type so it can safely peek
> the previous bpf_run_ctx.
>
> Since struct_ops is the only one that needs to peek into the previous
> run_ctx (through tramp_run_ctx->saved_run_ctx),  instead of adding 4
> bytes to the bpf_run_ctx, one idea just came to my mind is to use one
> bit in the tramp_run_ctx->saved_run_ctx pointer itsef.  Something like
> this if it reuses the bpf_cookie (probably missed some int/ptr type
> casting):
>
> #define BPF_RUN_CTX_STRUCT_OPS_BIT 1UL
>
> u64 notrace __bpf_prog_enter_struct_ops(struct bpf_prog *prog,
>                                      struct bpf_tramp_run_ctx *run_ctx)
>          __acquires(RCU)
> {
>         rcu_read_lock();
>         migrate_disable();
>
>         run_ctx->saved_run_ctx = bpf_set_run_ctx((&run_ctx->run_ctx) |
>                                         BPF_RUN_CTX_STRUCT_OPS_BIT);
>
>          return bpf_prog_start_time();
> }
>
> BPF_CALL_5(bpf_init_ops_setsockopt, struct sock *, sk, int, level,
>             int, optname, char *, optval, int, optlen)
> {
>         /* ... */
>         if (unlikely((run_ctx->saved_run_ctx &
>                         BPF_RUN_CTX_STRUCT_OPS_BIT) && ...) {
>                 /* ... */
>                 if (bpf_cookie == (uintptr_t)sk)
>                         return -EBUSY;
>         }
>
> }

If I understand correctly, the purpose of adding a field in run_ctx is
to tell the enclosing type from a generic bpf_run_ctx.

In the following lines:

> >> +       if (unlikely(run_ctx->saved_run_ctx &&
> >> +                    run_ctx->saved_run_ctx->type == BPF_RUN_CTX_TYPE_STRUCT_OPS)) {
> >> +               saved_run_ctx = (struct bpf_tramp_run_ctx *)run_ctx->saved_run_ctx;

the enclosing type of run_ctx->saved_run_ctx is a bpf_tramp_run_ctx,
so we can safely type cast it and further check sk.

The best way I can come up with is also what Martin thinks, maybe
encoding the type information in the lower bits of the saved_run_ctx
field.

Hao