On Tue, May 16, 2023 at 5:02 PM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote: > > > So taking a bit of a step back. In cover letter you mentioned: > > > > > We're observing some stalls on the heavily loaded machines > > > in the cgroup_bpf_prog_query path. This is likely due to > > > being blocked on cgroup_mutex. > > > > Is that likely an unconfirmed suspicion or you did see that > > cgroup_mutex lock is causing stalls? > > My intuition: we know that we have multiple-second stalls due > cgroup_mutex elsewhere and I don't see any other locks in the > prog_query path. I think more debugging is necessary here to root cause this multi-second stalls. Sounds like they're real. We have to understand them and fix the root issue. "Let's make cgroup_bpf_query lockless, because we can, and hope that it will help" is not a data driven development. I can imagine that copy_to_user-s done from __cgroup_bpf_query with cgroup_lock taken are causing delay, but multi-second ?! There must be something else. If copy_to_user is indeed the issue, we can move just that part to be done after cgroup_unlock. Note cgroup_bpf_attach/detach don't do user access, so shouldn't be influenced by faults in user space.