On Wed, May 17, 2023 at 3:25 PM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Tue, May 16, 2023 at 5:02 PM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote: > > > > > So taking a bit of a step back. In cover letter you mentioned: > > > > > > > We're observing some stalls on the heavily loaded machines > > > > in the cgroup_bpf_prog_query path. This is likely due to > > > > being blocked on cgroup_mutex. > > > > > > Is that likely an unconfirmed suspicion or you did see that > > > cgroup_mutex lock is causing stalls? > > > > My intuition: we know that we have multiple-second stalls due > > cgroup_mutex elsewhere and I don't see any other locks in the > > prog_query path. > > I think more debugging is necessary here to root cause this multi-second stalls. > Sounds like they're real. We have to understand them and fix > the root issue. > "Let's make cgroup_bpf_query lockless, because we can, and hope that > it will help" is not a data driven development. > > I can imagine that copy_to_user-s done from __cgroup_bpf_query > with cgroup_lock taken are causing delay, but multi-second ?! > There must be something else. > If copy_to_user is indeed the issue, we can move just that part > to be done after cgroup_unlock. > Note cgroup_bpf_attach/detach don't do user access, so shouldn't > be influenced by faults in user space. It's definitely not our path that is slow. Some other path that grabs cgroup_mutex can hold it for arbitrary time which makes our bpf_query path wait on it. That's why I was hoping that we can just avoid locking cgroup_mutex where possible (at least query/read paths). Hao, can you share more about what particular path is causing the issue? I don't see anything mentioned on the internal bug, but maybe you have something to share?