Re: [PATCH bpf-next 02/14] bpf: net: Avoid sock_setsockopt() taking sk lock when called from bpf

Martin KaFai Lau <kafai@xxxxxx> · Wed, 27 Jul 2022 17:45:46 -0700

On Wed, Jul 27, 2022 at 02:38:51PM -0700, Stanislav Fomichev wrote:
> On Wed, Jul 27, 2022 at 2:21 PM Martin KaFai Lau <kafai@xxxxxx> wrote:
> >
> > On Wed, Jul 27, 2022 at 01:39:08PM -0700, Stanislav Fomichev wrote:
> > > On Wed, Jul 27, 2022 at 11:37 AM Martin KaFai Lau <kafai@xxxxxx> wrote:
> > > >
> > > > On Wed, Jul 27, 2022 at 09:47:25AM -0700, sdf@xxxxxxxxxx wrote:
> > > > > On 07/26, Martin KaFai Lau wrote:
> > > > > > Most of the codes in bpf_setsockopt(SOL_SOCKET) are duplicated from
> > > > > > the sock_setsockopt().  The number of supported options are
> > > > > > increasing ever and so as the duplicated codes.
> > > > >
> > > > > > One issue in reusing sock_setsockopt() is that the bpf prog
> > > > > > has already acquired the sk lock.  sockptr_t is useful to handle this.
> > > > > > sockptr_t already has a bit 'is_kernel' to handle the kernel-or-user
> > > > > > memory copy.  This patch adds a 'is_bpf' bit to tell if sk locking
> > > > > > has already been ensured by the bpf prog.
> > > > >
> > > > > Why not explicitly call it is_locked/is_unlocked? I'm assuming, at some
> > > > > point,
> > > > is_locked was my initial attempt.  The bpf_setsockopt() also skips
> > > > the ns_capable() check, like in patch 3.  I ended up using
> > > > one is_bpf bit here to do both.
> > >
> > > Yeah, sorry, I haven't read the whole series before I sent my first
> > > reply. Let's discuss it here.
> > >
> > > This reminds me of ns_capable in __inet_bind where we also had to add
> > > special handling.
> > >
> > > In general, not specific to the series, I wonder if we want some new
> > > in_bpf() context indication and bypass ns_capable() from those
> > > contexts?
> > > Then we can do things like:
> > >
> > >   if (sk->sk_bound_dev_if && !in_bpf() && !ns_capable(net->user_ns,
> > > CAP_NET_RAW))
> > >     return ...;
> > Don't see a way to implement in_bpf() after some thoughts.
> > Do you have idea ?
> 
> I wonder if we can cheat a bit with the following:
> 
> bool setsockopt_capable(struct user_namespace *ns, int cap)
> {
>        if (!in_task()) {
>              /* Running in irq/softirq -> setsockopt invoked by bpf program.
>               * [not sure, is it safe to assume no regular path leads
> to setsockopt from sirq?]
>               */
>              return true;
>        }
> 
>        /* Running in process context, task has bpf_ctx set -> invoked
> by bpf program. */
>        if (current->bpf_ctx != NULL)
>              return true;
> 
>        return ns_capable(ns, cap);
> }
> 
> And then do /ns_capable/setsockopt_capable/ in net/core/sock.c
> 
> But that might be more fragile than passing the flag, idk.
I think it should work.  From a quick look, all bpf_setsockopt usage has
bpf_ctx.  The one from bpf_tcp_ca (struct_ops) and bpf_iter is trampoline
which also has bpf_ctx.  Not sure about the future use cases.

To be honest, I am not sure if I have missed cases and also have similar questions
your have in the above sample code.  This may deserve a separate patch
set for discussion.  Using a bit in sockptr is mostly free now.
WDYT ?

> 
> > > Or would it make things more confusing?
> > >
> > >
> > >
> > > > > we can have code paths in bpf where the socket has been already locked by
> > > > > the stack?
> > > > hmm... You meant the opposite, like the bpf hook does not have the
> > > > lock pre-acquired before the bpf prog gets run and sock_setsockopt()
> > > > should do lock_sock() as usual?
> > > >
> > > > I was thinking a likely situation is a bpf 'sleepable' hook does not
> > > > have the lock pre-acquired.  In that case, the bpf_setsockopt() could
> > > > always acquire the lock first but it may turn out to be too
> > > > pessmissitic for the future bpf_[G]etsockopt() refactoring.
> > > >
> > > > or we could do this 'bit' break up (into one is_locked bit
> > > > for locked and one is_bpf to skip-capable-check).  I was waiting until a real
> > > > need comes up instead of having both bits always true now.  I don't mind to
> > > > add is_locked now since the bpf_lsm_cgroup may come to sleepable soon.
> > > > I can do this in the next spin.