Zhang Qiao <zhangqiao22@xxxxxxxxxx> writes: > Hi, tejun > > Thanks for your reply. > > 在 2022/6/27 16:32, Tejun Heo 写道: >> Hello, >> >> On Mon, Jun 27, 2022 at 02:50:25PM +0800, Zhang Qiao wrote: >>> Becuase the task cgroup's cpu.cfs_quota_us is very small and >>> test_fork's load is very heavy, the test_fork may be throttled long >>> time, therefore, the cgroup_threadgroup_rw_sem read lock is held for >>> a long time, other processes will get stuck waiting for the lock: >> >> Yeah, this is a known problem and can happen with other locks too. The >> solution prolly is only throttling while in or when about to return to >> userspace. There is one really important and wide-spread assumption in >> the kernel: >> >> If things get blocked on some shared resource, whatever is holding >> the resource ends up using more of the system to exit the critical >> section faster and thus unblocks others ASAP. IOW, things running in >> kernel are work-conserving. >> >> The cpu bw controller gives the userspace a rather easy way to break >> this assumption and thus is rather fundamentally broken. This is >> basically the same problem we had with the old cgroup freezer >> implementation which trapped threads in random locations in the >> kernel. >> > > so, if we want to completely slove this problem, is the best way to > change the cfs bw controller throttle mechanism? for example, throttle > tasks in a safe location. Yes, fixing (kernel) priority inversion due to CFS_BANDWIDTH requires a serious reworking of how it works, because it would need to dequeue tasks individually rather than doing the entire cfs_rq at a time (and would require some effort to avoid pinging every throttling task to get it into the kernel).