Re: [PATCH bpf-next 4/4] bpf: query effective progs without cgroup_mutex

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Wed, 17 May 2023 15:25:40 -0700

On Tue, May 16, 2023 at 5:02 PM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote:
>
> > So taking a bit of a step back. In cover letter you mentioned:
> >
> >   > We're observing some stalls on the heavily loaded machines
> >   > in the cgroup_bpf_prog_query path. This is likely due to
> >   > being blocked on cgroup_mutex.
> >
> > Is that likely an unconfirmed suspicion or you did see that
> > cgroup_mutex lock is causing stalls?
>
> My intuition: we know that we have multiple-second stalls due
> cgroup_mutex elsewhere and I don't see any other locks in the
> prog_query path.

I think more debugging is necessary here to root cause this multi-second stalls.
Sounds like they're real. We have to understand them and fix
the root issue.
"Let's make cgroup_bpf_query lockless, because we can, and hope that
it will help" is not a data driven development.

I can imagine that copy_to_user-s done from __cgroup_bpf_query
with cgroup_lock taken are causing delay, but multi-second ?!
There must be something else.
If copy_to_user is indeed the issue, we can move just that part
to be done after cgroup_unlock.
Note cgroup_bpf_attach/detach don't do user access, so shouldn't
be influenced by faults in user space.