Re: [PATCH bpf-next 4/5] bpf: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Fri, 23 Sep 2022 11:27:21 -0700

On Fri, Sep 23, 2022 at 10:46 AM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote:
>
> On 9/23/22 8:26 AM, Alexei Starovoitov wrote:
> > On Thu, Sep 22, 2022 at 6:11 PM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote:
> >>
> >> On 9/22/22 5:12 PM, Alexei Starovoitov wrote:
> >>> On Thu, Sep 22, 2022 at 3:56 PM Martin KaFai Lau <kafai@xxxxxx> wrote:
> >>>>
> >>>> From: Martin KaFai Lau <martin.lau@xxxxxxxxxx>
> >>>>
> >>>> When a bad bpf prog '.init' calls
> >>>> bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop:
> >>>>
> >>>> .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ...
> >>>> ... => .init => bpf_setsockopt(tcp_cc).
> >>>>
> >>>> It was prevented by the prog->active counter before but the prog->active
> >>>> detection cannot be used in struct_ops as explained in the earlier
> >>>> patch of the set.
> >>>>
> >>>> In this patch, the second bpf_setsockopt(tcp_cc) is not allowed
> >>>> in order to break the loop.  This is done by checking the
> >>>> previous bpf_run_ctx has saved the same sk pointer in the
> >>>> bpf_cookie.
> >>>>
> >>>> Note that this essentially limits only the first '.init' can
> >>>> call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer
> >>>> does not support ECN) and the second '.init' cannot fallback to
> >>>> another cc.  This applies even the second
> >>>> bpf_setsockopt(TCP_CONGESTION) will not cause a loop.
> >>>>
> >>>> Signed-off-by: Martin KaFai Lau <martin.lau@xxxxxxxxxx>
> >>>> ---
> >>>>    include/linux/filter.h |  3 +++
> >>>>    net/core/filter.c      |  4 ++--
> >>>>    net/ipv4/bpf_tcp_ca.c  | 54 ++++++++++++++++++++++++++++++++++++++++++
> >>>>    3 files changed, 59 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/include/linux/filter.h b/include/linux/filter.h
> >>>> index 98e28126c24b..9942ecc68a45 100644
> >>>> --- a/include/linux/filter.h
> >>>> +++ b/include/linux/filter.h
> >>>> @@ -911,6 +911,9 @@ int sk_get_filter(struct sock *sk, sockptr_t optval, unsigned int len);
> >>>>    bool sk_filter_charge(struct sock *sk, struct sk_filter *fp);
> >>>>    void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp);
> >>>>
> >>>> +int _bpf_setsockopt(struct sock *sk, int level, int optname,
> >>>> +                   char *optval, int optlen);
> >>>> +
> >>>>    u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
> >>>>    #define __bpf_call_base_args \
> >>>>           ((u64 (*)(u64, u64, u64, u64, u64, const struct bpf_insn *)) \
> >>>> diff --git a/net/core/filter.c b/net/core/filter.c
> >>>> index f4cea3ff994a..e56a1ebcf1bc 100644
> >>>> --- a/net/core/filter.c
> >>>> +++ b/net/core/filter.c
> >>>> @@ -5244,8 +5244,8 @@ static int __bpf_setsockopt(struct sock *sk, int level, int optname,
> >>>>           return -EINVAL;
> >>>>    }
> >>>>
> >>>> -static int _bpf_setsockopt(struct sock *sk, int level, int optname,
> >>>> -                          char *optval, int optlen)
> >>>> +int _bpf_setsockopt(struct sock *sk, int level, int optname,
> >>>> +                   char *optval, int optlen)
> >>>>    {
> >>>>           if (sk_fullsock(sk))
> >>>>                   sock_owned_by_me(sk);
> >>>> diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
> >>>> index 6da16ae6a962..a9f2cab5ffbc 100644
> >>>> --- a/net/ipv4/bpf_tcp_ca.c
> >>>> +++ b/net/ipv4/bpf_tcp_ca.c
> >>>> @@ -144,6 +144,57 @@ static const struct bpf_func_proto bpf_tcp_send_ack_proto = {
> >>>>           .arg2_type      = ARG_ANYTHING,
> >>>>    };
> >>>>
> >>>> +BPF_CALL_5(bpf_init_ops_setsockopt, struct sock *, sk, int, level,
> >>>> +          int, optname, char *, optval, int, optlen)
> >>>> +{
> >>>> +       struct bpf_tramp_run_ctx *run_ctx, *saved_run_ctx;
> >>>> +       int ret;
> >>>> +
> >>>> +       if (optname != TCP_CONGESTION)
> >>>> +               return _bpf_setsockopt(sk, level, optname, optval, optlen);
> >>>> +
> >>>> +       run_ctx = (struct bpf_tramp_run_ctx *)current->bpf_ctx;
> >>>> +       if (unlikely(run_ctx->saved_run_ctx &&
> >>>> +                    run_ctx->saved_run_ctx->type == BPF_RUN_CTX_TYPE_STRUCT_OPS)) {
> >>>> +               saved_run_ctx = (struct bpf_tramp_run_ctx *)run_ctx->saved_run_ctx;
> >>>> +               /* It stops this looping
> >>>> +                *
> >>>> +                * .init => bpf_setsockopt(tcp_cc) => .init =>
> >>>> +                * bpf_setsockopt(tcp_cc)" => .init => ....
> >>>> +                *
> >>>> +                * The second bpf_setsockopt(tcp_cc) is not allowed
> >>>> +                * in order to break the loop when both .init
> >>>> +                * are the same bpf prog.
> >>>> +                *
> >>>> +                * This applies even the second bpf_setsockopt(tcp_cc)
> >>>> +                * does not cause a loop.  This limits only the first
> >>>> +                * '.init' can call bpf_setsockopt(TCP_CONGESTION) to
> >>>> +                * pick a fallback cc (eg. peer does not support ECN)
> >>>> +                * and the second '.init' cannot fallback to
> >>>> +                * another cc.
> >>>> +                */
> >>>> +               if (saved_run_ctx->bpf_cookie == (uintptr_t)sk)
> >>>> +                       return -EBUSY;
> >>>> +       }
> >>>> +
> >>>> +       run_ctx->bpf_cookie = (uintptr_t)sk;
> >>>> +       ret = _bpf_setsockopt(sk, level, optname, optval, optlen);
> >>>> +       run_ctx->bpf_cookie = 0;
> >>>
> >>> Instead of adding 4 bytes for enum in patch 3
> >>> (which will be 8 bytes due to alignment)
> >>> and abusing bpf_cookie here
> >>> (which struct_ops bpf prog might eventually read and be surprised
> >>> to find sk pointer in there)
> >>> how about adding 'struct task_struct *saved_current' as another arg
> >>> to bpf_tramp_run_ctx ?
> >>> Always store the current task in there in prog_entry_struct_ops
> >>> and then compare it here in this specialized bpf_init_ops_setsockopt?
> >>>
> >>> Or maybe always check in enter_prog_struct_ops:
> >>> if (container_of(current->bpf_ctx, struct bpf_tramp_run_ctx,
> >>> run_ctx)->saved_current == current) // goto out since recursion?
> >>> it will prevent issues in case we don't know about and will
> >>> address the good recursion case as explained in patch 1?
> >>> I'm assuming 2nd ssthresh runs in a different task..
> >>> Or is it actually the same task?
> >>
> >> The 2nd ssthresh() should run in the same task but different sk.  The
> >> first ssthresh(sk[1]) was run in_task() context and then got
> >> interrupted.  The softirq then handles the rcv path which just happens
> >> to also call ssthresh(sk[2]) in the unlikely pkt-loss case. It is like
> >> ssthresh(sk[1]) => softirq => ssthresh(sk[2]).
> >>
> >> The tcp-cc ops can recur but cannot recur on the same sk because it
> >> requires to hold the sk lock, so the patch remembers what was the
> >> previous sk to ensure it does not recur on the same sk.  Then it needs
> >> to peek into the previous run ctx which may not always be
> >> bpf_trump_run_ctx.  eg. a cg bpf prog (with bpf_cg_run_ctx) can call
> >> bpf_setsockopt(TCP_CONGESTION, "a_bpf_tcp_cc") which then will call the
> >> a_bpf_tcp_cc->init().  It needs a bpf_run_ctx_type so it can safely peek
> >> the previous bpf_run_ctx.
> >
> > got it.
> >
> >>
> >> Since struct_ops is the only one that needs to peek into the previous
> >> run_ctx (through tramp_run_ctx->saved_run_ctx),  instead of adding 4
> >> bytes to the bpf_run_ctx, one idea just came to my mind is to use one
> >> bit in the tramp_run_ctx->saved_run_ctx pointer itsef.  Something like
> >> this if it reuses the bpf_cookie (probably missed some int/ptr type
> >> casting):
> >>
> >> #define BPF_RUN_CTX_STRUCT_OPS_BIT 1UL
> >>
> >> u64 notrace __bpf_prog_enter_struct_ops(struct bpf_prog *prog,
> >>                                       struct bpf_tramp_run_ctx *run_ctx)
> >>           __acquires(RCU)
> >> {
> >>          rcu_read_lock();
> >>          migrate_disable();
> >>
> >>          run_ctx->saved_run_ctx = bpf_set_run_ctx((&run_ctx->run_ctx) |
> >>                                          BPF_RUN_CTX_STRUCT_OPS_BIT);
> >>
> >>           return bpf_prog_start_time();
> >> }
> >>
> >> BPF_CALL_5(bpf_init_ops_setsockopt, struct sock *, sk, int, level,
> >>              int, optname, char *, optval, int, optlen)
> >> {
> >>          /* ... */
> >>          if (unlikely((run_ctx->saved_run_ctx &
> >>                          BPF_RUN_CTX_STRUCT_OPS_BIT) && ...) {
> >>                  /* ... */
> >>                  if (bpf_cookie == (uintptr_t)sk)
> >>                          return -EBUSY;
> >>          }
> >>
> >> }
> >
> > that should work, but don't you need to loop through all previous
> > run_ctx and check all with BPF_RUN_CTX_STRUCT_OPS_BIT type ?
> > Since run_ctx is saved in the task and we have preemptible
> > rpgos there could be tracing prog in the chain:
> > struct_ops_run_ctx->tracing_run_ctx->struct_ops_run_ctx
> > where 1st and last have the same 'sk'.
>
>
> This interleave of different run_ctx could happen.  My understanding is
> the 'struct_ops_run_ctx' can only be created when the tcp stack is
> calling the 'bpf_tcp_cc->init()' (or other cc ops).  In the above case,
> the first and second struct_ops_run_ctx are interleaved with a
> tracing_run_ctx.  Each of these two struct_ops_run_ctx was created from
> a different 'bpf_tcp_cc->init()' call by the kernel tcp stack.  They
> cannot be called with the same sk and changing that sk at the same time
> like this.  Otherwise, the kernel stack has a bug.

There could be also kprobe context in the chain, not necessarily
trampoline-based context. You want to look at previous struct_ops
run_ctx (if any), but it's not necessarily run_ctx->saved_run_ctx. It
could be one of the still earlier ones in the chain. And given kprobe
run_ctx doesn't have saved_run_ctx field and don't preserve the chain
of run_ctxs, there is no reliable way to check entire chain of
run_ctxs.

BPF_RUN_CTX_STRUCT_OPS_BIT is a bit dangerous if we ever do a similar
bit trick for some other type of run_ctx (which honestly we should
avoid). Enum would be safer, but still, you need to check the entire
chain of run_ctxs, which we do not preserve.

It seems to me that run_ctx is not the right mechanism to use here,
tbh. Are there any other alternatives?