On Thu, May 11, 2023 at 10:21 AM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote: > > When querying bpf prog list, we don't really need to hold > cgroup_mutex. There is only one caller of cgroup_bpf_query > (cgroup_bpf_prog_query) and it does cgroup_get/put, so we're > safe WRT cgroup going way. However, we if we stop grabbing > cgroup_mutex, we risk racing with the prog attach/detach path > to the same cgroup, so here is how to work it around. > > We have two case: > 1. querying effective array > 2. querying non-effective list > > (1) is easy because it's already a RCU-managed array, so all we > need is to make a copy of that array (under rcu read lock) > into a temporary buffer and copy that temporary buffer back > to userspace. > > (2) is more involved because we keep the list of progs and it's > not managed by RCU. However, it seems relatively simple to > convert that hlist to the RCU-managed one: convert the readers > to use hlist_xxx_rcu and replace kfree with kfree_rcu. One > other notable place is cgroup_bpf_release where we replace > hlist_for_each_entry_safe with hlist_for_each_entry_rcu. This > should be safe because hlist_del_rcu does not remove/poison > forward pointer of the list entry, so it's safe to remove > the elements while iterating (without specially flavored > for_each_safe wrapper). > > For (2), we also need to take care of flags. I added a bunch > of READ_ONCE/WRITE_ONCE to annotate lockless access. And I > also moved flag update path to happen before adding prog > to the list to make sure readers observe correct flags. > > Signed-off-by: Stanislav Fomichev <sdf@xxxxxxxxxx> > --- > include/linux/bpf-cgroup-defs.h | 2 +- > include/linux/bpf-cgroup.h | 1 + > kernel/bpf/cgroup.c | 152 ++++++++++++++++++-------------- > 3 files changed, 90 insertions(+), 65 deletions(-) > Few reservations from my side: 1. You are adding 16 bytes to bpf_prog_list, of which there could be *tons* copies of. It might be ok, but slightly speeding up something that's not even considered to be a performance-critical operation (prog query) at the expense of more memory usage feels a bit odd. 2. This code is already pretty tricky, and that's under the simplifying conditions of cgroup_mutex being held. We are now making it even more complicated without locks being held. 3. This code is probably going to be changed again once Daniel's multi-prog API lands, so we are going to do this exercise again afterwards? So taking a bit of a step back. In cover letter you mentioned: > We're observing some stalls on the heavily loaded machines > in the cgroup_bpf_prog_query path. This is likely due to > being blocked on cgroup_mutex. Is that likely an unconfirmed suspicion or you did see that cgroup_mutex lock is causing stalls? Also, I wonder if you tried using BPF program or cgroup iterators to get all this information from BPF side? I wonder if that's feasible? [...]