On Tue, May 16, 2023 at 3:02 PM Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote: > > On Thu, May 11, 2023 at 10:21 AM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote: > > > > When querying bpf prog list, we don't really need to hold > > cgroup_mutex. There is only one caller of cgroup_bpf_query > > (cgroup_bpf_prog_query) and it does cgroup_get/put, so we're > > safe WRT cgroup going way. However, we if we stop grabbing > > cgroup_mutex, we risk racing with the prog attach/detach path > > to the same cgroup, so here is how to work it around. > > > > We have two case: > > 1. querying effective array > > 2. querying non-effective list > > > > (1) is easy because it's already a RCU-managed array, so all we > > need is to make a copy of that array (under rcu read lock) > > into a temporary buffer and copy that temporary buffer back > > to userspace. > > > > (2) is more involved because we keep the list of progs and it's > > not managed by RCU. However, it seems relatively simple to > > convert that hlist to the RCU-managed one: convert the readers > > to use hlist_xxx_rcu and replace kfree with kfree_rcu. One > > other notable place is cgroup_bpf_release where we replace > > hlist_for_each_entry_safe with hlist_for_each_entry_rcu. This > > should be safe because hlist_del_rcu does not remove/poison > > forward pointer of the list entry, so it's safe to remove > > the elements while iterating (without specially flavored > > for_each_safe wrapper). > > > > For (2), we also need to take care of flags. I added a bunch > > of READ_ONCE/WRITE_ONCE to annotate lockless access. And I > > also moved flag update path to happen before adding prog > > to the list to make sure readers observe correct flags. > > > > Signed-off-by: Stanislav Fomichev <sdf@xxxxxxxxxx> > > --- > > include/linux/bpf-cgroup-defs.h | 2 +- > > include/linux/bpf-cgroup.h | 1 + > > kernel/bpf/cgroup.c | 152 ++++++++++++++++++-------------- > > 3 files changed, 90 insertions(+), 65 deletions(-) > > > > Few reservations from my side: > > 1. You are adding 16 bytes to bpf_prog_list, of which there could be > *tons* copies of. It might be ok, but slightly speeding up something > that's not even considered to be a performance-critical operation > (prog query) at the expense of more memory usage feels a bit odd. > > 2. This code is already pretty tricky, and that's under the > simplifying conditions of cgroup_mutex being held. We are now making > it even more complicated without locks being held. > > 3. This code is probably going to be changed again once Daniel's > multi-prog API lands, so we are going to do this exercise again > afterwards? I'm happy to wait for (3). From my pow (2) is the most concerning. The code is a bit complicated (and my patches are not helping), maybe that's a sign that we need to clean it up :-) Some parts are rcu-safe, some aren't. cgroup_mutex usage looks like something that was done long ago for simplicity and might not apply anymore. We now have machines which have multiple progs attached per cgroup; grabbing global lock just to query the list seems excessive. > So taking a bit of a step back. In cover letter you mentioned: > > > We're observing some stalls on the heavily loaded machines > > in the cgroup_bpf_prog_query path. This is likely due to > > being blocked on cgroup_mutex. > > Is that likely an unconfirmed suspicion or you did see that > cgroup_mutex lock is causing stalls? My intuition: we know that we have multiple-second stalls due cgroup_mutex elsewhere and I don't see any other locks in the prog_query path. > Also, I wonder if you tried using BPF program or cgroup iterators to > get all this information from BPF side? I wonder if that's feasible? Cgroup iterator has the following in the comment: Note: the iter_prog is called with cgroup_mutex held. I can probably use a link iterator; I would have to upcast bpf_link to bpf_cgroup_link (via bpf_probe_read?) to get to the cgroup id, but it seems like a workaround?