Re: [PATCH bpf-next 4/4] bpf: query effective progs without cgroup_mutex

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Tue, 16 May 2023 15:02:36 -0700

On Thu, May 11, 2023 at 10:21 AM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote:
>
> When querying bpf prog list, we don't really need to hold
> cgroup_mutex. There is only one caller of cgroup_bpf_query
> (cgroup_bpf_prog_query) and it does cgroup_get/put, so we're
> safe WRT cgroup going way. However, we if we stop grabbing
> cgroup_mutex, we risk racing with the prog attach/detach path
> to the same cgroup, so here is how to work it around.
>
> We have two case:
> 1. querying effective array
> 2. querying non-effective list
>
> (1) is easy because it's already a RCU-managed array, so all we
> need is to make a copy of that array (under rcu read lock)
> into a temporary buffer and copy that temporary buffer back
> to userspace.
>
> (2) is more involved because we keep the list of progs and it's
> not managed by RCU. However, it seems relatively simple to
> convert that hlist to the RCU-managed one: convert the readers
> to use hlist_xxx_rcu and replace kfree with kfree_rcu. One
> other notable place is cgroup_bpf_release where we replace
> hlist_for_each_entry_safe with hlist_for_each_entry_rcu. This
> should be safe because hlist_del_rcu does not remove/poison
> forward pointer of the list entry, so it's safe to remove
> the elements while iterating (without specially flavored
> for_each_safe wrapper).
>
> For (2), we also need to take care of flags. I added a bunch
> of READ_ONCE/WRITE_ONCE to annotate lockless access. And I
> also moved flag update path to happen before adding prog
> to the list to make sure readers observe correct flags.
>
> Signed-off-by: Stanislav Fomichev <sdf@xxxxxxxxxx>
> ---
>  include/linux/bpf-cgroup-defs.h |   2 +-
>  include/linux/bpf-cgroup.h      |   1 +
>  kernel/bpf/cgroup.c             | 152 ++++++++++++++++++--------------
>  3 files changed, 90 insertions(+), 65 deletions(-)
>

Few reservations from my side:

1. You are adding 16 bytes to bpf_prog_list, of which there could be
*tons* copies of. It might be ok, but slightly speeding up something
that's not even considered to be a performance-critical operation
(prog query) at the expense of more memory usage feels a bit odd.

2. This code is already pretty tricky, and that's under the
simplifying conditions of cgroup_mutex being held. We are now making
it even more complicated without locks being held.

3. This code is probably going to be changed again once Daniel's
multi-prog API lands, so we are going to do this exercise again
afterwards?

So taking a bit of a step back. In cover letter you mentioned:

  > We're observing some stalls on the heavily loaded machines
  > in the cgroup_bpf_prog_query path. This is likely due to
  > being blocked on cgroup_mutex.

Is that likely an unconfirmed suspicion or you did see that
cgroup_mutex lock is causing stalls?

Also, I wonder if you tried using BPF program or cgroup iterators to
get all this information from BPF side? I wonder if that's feasible?

[...]