Re: [PATCH bpf-next 4/4] bpf: query effective progs without cgroup_mutex

Stanislav Fomichev <sdf@xxxxxxxxxx> · Tue, 16 May 2023 17:02:42 -0700

On Tue, May 16, 2023 at 3:02 PM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> On Thu, May 11, 2023 at 10:21 AM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote:
> >
> > When querying bpf prog list, we don't really need to hold
> > cgroup_mutex. There is only one caller of cgroup_bpf_query
> > (cgroup_bpf_prog_query) and it does cgroup_get/put, so we're
> > safe WRT cgroup going way. However, we if we stop grabbing
> > cgroup_mutex, we risk racing with the prog attach/detach path
> > to the same cgroup, so here is how to work it around.
> >
> > We have two case:
> > 1. querying effective array
> > 2. querying non-effective list
> >
> > (1) is easy because it's already a RCU-managed array, so all we
> > need is to make a copy of that array (under rcu read lock)
> > into a temporary buffer and copy that temporary buffer back
> > to userspace.
> >
> > (2) is more involved because we keep the list of progs and it's
> > not managed by RCU. However, it seems relatively simple to
> > convert that hlist to the RCU-managed one: convert the readers
> > to use hlist_xxx_rcu and replace kfree with kfree_rcu. One
> > other notable place is cgroup_bpf_release where we replace
> > hlist_for_each_entry_safe with hlist_for_each_entry_rcu. This
> > should be safe because hlist_del_rcu does not remove/poison
> > forward pointer of the list entry, so it's safe to remove
> > the elements while iterating (without specially flavored
> > for_each_safe wrapper).
> >
> > For (2), we also need to take care of flags. I added a bunch
> > of READ_ONCE/WRITE_ONCE to annotate lockless access. And I
> > also moved flag update path to happen before adding prog
> > to the list to make sure readers observe correct flags.
> >
> > Signed-off-by: Stanislav Fomichev <sdf@xxxxxxxxxx>
> > ---
> >  include/linux/bpf-cgroup-defs.h |   2 +-
> >  include/linux/bpf-cgroup.h      |   1 +
> >  kernel/bpf/cgroup.c             | 152 ++++++++++++++++++--------------
> >  3 files changed, 90 insertions(+), 65 deletions(-)
> >
>
> Few reservations from my side:
>
> 1. You are adding 16 bytes to bpf_prog_list, of which there could be
> *tons* copies of. It might be ok, but slightly speeding up something
> that's not even considered to be a performance-critical operation
> (prog query) at the expense of more memory usage feels a bit odd.
>
> 2. This code is already pretty tricky, and that's under the
> simplifying conditions of cgroup_mutex being held. We are now making
> it even more complicated without locks being held.
>
> 3. This code is probably going to be changed again once Daniel's
> multi-prog API lands, so we are going to do this exercise again
> afterwards?

I'm happy to wait for (3). From my pow (2) is the most concerning. The
code is a bit complicated (and my patches are not helping), maybe
that's a sign that we need to clean it up :-)
Some parts are rcu-safe, some aren't. cgroup_mutex usage looks like
something that was done long ago for simplicity and might not apply
anymore. We now have machines which have multiple progs attached per
cgroup; grabbing global lock just to query the list seems excessive.

> So taking a bit of a step back. In cover letter you mentioned:
>
>   > We're observing some stalls on the heavily loaded machines
>   > in the cgroup_bpf_prog_query path. This is likely due to
>   > being blocked on cgroup_mutex.
>
> Is that likely an unconfirmed suspicion or you did see that
> cgroup_mutex lock is causing stalls?

My intuition: we know that we have multiple-second stalls due
cgroup_mutex elsewhere and I don't see any other locks in the
prog_query path.

> Also, I wonder if you tried using BPF program or cgroup iterators to
> get all this information from BPF side? I wonder if that's feasible?

Cgroup iterator has the following in the comment:
Note: the iter_prog is called with cgroup_mutex held.

I can probably use a link iterator; I would have to upcast bpf_link to
bpf_cgroup_link (via bpf_probe_read?) to get to the cgroup id, but it
seems like a workaround?