Re: [PATCH -v2] cgroup: fix deadlock caused by cgroup_mutex and cpu_hotplug_lock

chenridong <chenridong@xxxxxxxxxx> · Thu, 8 Aug 2024 10:22:21 +0800

On 2024/8/7 21:32, Michal Koutný wrote:
Hello.

On Sat, Jul 27, 2024 at 06:21:55PM GMT, chenridong <chenridong@xxxxxxxxxx> wrote:
Yes, I have offered the scripts in Link(V1).

Thanks (and thanks for patience).
There is no lockdep complain about a deadlock (i.e. some circular
locking dependencies). (I admit the multiple holders of cgroup_mutex
reported there confuse me, I guess that's an artifact of this lockdep
report and they could be also waiters.)

Who'd be the holder of cgroup_mutex preventing cgroup_bpf_release from
progress? (That's not clear to me from your diagram.)

This is a cumulative process. The stress testing deletes a large member of
cgroups, and cgroup_bpf_release is asynchronous, competing with cgroup
release works.

Those are different situations:
- waiting for one holder that's stuck for some reason (that's what we're
   after),
- waiting because the mutex is contended (that's slow but progresses
   eventually).

You know, cgroup_mutex is used in many places. Finally, the number of
`cgroup_bpf_release` instances in system_wq accumulates up to 256, and
it leads to this issue.

Reaching max_active doesn't mean that queue_work() would block or the
items were lost. They are only queued onto inactive_works list.

Yes, I agree. But what if 256 active works can't finish because they are 
waiting for a lock? the works at inactive list can never be executed.
(Remark: cgroup_destroy_wq has only max_active=1 but it apparently
doesn't stop progress should there be more items queued (when
when cgroup_mutex is not guarding losing references.))

cgroup_destroy_wq is not stopped by cgroup_mutex, it has acquired 
cgroup_mutex, but it was blocked cpu_hotplug_lock.read. 
cpu_hotplug_lock.write is held by cpu offline process(step3).
---

The change on its own (deferred cgroup bpf progs removal via
cgroup_destroy_wq instead of system_wq) is sensible by collecting
related objects removal together (at the same time it shouldn't cause
problems by sharing one cgroup_destroy_wq).

But the reasoning in the commit message doesn't add up to me. There
isn't obvious deadlock, I'd say that system is overloaded with repeated
calls of __lockup_detector_reconfigure() and it is not in deadlock
state -- i.e. when you stop the test, it should eventually recover.
Given that, I'd neither put Fixes: 4bfc0bb2c60e there.
If I stop test, it can never recover. It does not need to be fixed if it 
could recover.
I have to admit, it is a complicated issue.

System_wq was not overloaded with __lockup_detector_reconfigure, but 
with cgroup_bpf_release_fn. A large number of cgroups were deleted. 
There were 256 active works in system_wq that were 
cgroup_bpf_release_fn, and they were all blocked by cgroup_mutex.

To make it simple, just imagine what if the max_active max_active of 
system_wq is 1? Could it result in a deadlock? If it could be deadlock, 
just imagine all works in system_wq are same.

(One could symetrically argue to move smp_call_on_cpu() away from
system_wq instead of cgroup_bpf_release_fn().)

I also agree, why I move cgroup_bpf_release_fn away, cgroup has it own 
queue. As TJ said "system wqs are for misc things which shouldn't create 
a large number of concurrent work items. If something is going to 
generate 256+ concurrent work items, it should use its own workqueue."

Honestly, I'm not sure it's worth the effort if there's no deadlock.

There is a deadlock, and i think it have to be fixed.
It's possible that I'm misunderstanding or I've missed a substantial
detail for why this could lead to a deadlock. It'd be best visible in a
sequence diagram with tasks/CPUs left-to-right and time top-down (in the
original scheme it looks like time goes right-to-left and there's the
unclear situation of the initial cgroup_mutex holder).

Thanks,
Michal

I will modify the diagram.
And I hope you can understand how it leads to deadlock.
Thank you Michal for your reply.

Thanks,
Ridong