On Mon, Feb 21, 2022 at 12:55 AM Michal Hocko <mhocko@xxxxxxxx> wrote: > It would be cool to have some numbers here. Are there any numbers beyond what Suren mentioned that would be useful? As one example, in a trace of a camera workload that I opened at random to check for drain_local_pages stalls, I saw the kworker that ran drain_local_pages stay at runnable for 68ms before getting any CPU time. I could try to query our trace corpus to find more examples, but they're not hard to find in individual traces already. > If the draining is too slow and dependent on the current CPU/WQ > contention then we should address that. The original intention was that > having a dedicated WQ with WQ_MEM_RECLAIM would help to isolate the > operation from the rest of WQ activity. Maybe we need to fine tune > mm_percpu_wq. If that doesn't help then we should revise the WQ model > and use something else. Memory reclaim shouldn't really get stuck behind > other unrelated work. In my experience, workqueues are easy to misuse and should be approached with a lot of care. For many workloads, they work fine 99%+ of the time, but once you run into problems with scheduling delays for that workqueue, the only option is to stop using workqueues. If you have work that is system-initiated with minimal latency requirements (eg, some driver heartbeat every so often, devfreq governors, things like that), workqueues are great. If you have userspace-initiated work that should respect priority (eg, GPU command buffer submission in the critical path of UI) or latency-critical system-initiated work (eg, display synchronization around panel refresh), workqueues are the wrong choice because there is no RT capability. WQ_HIGHPRI has a minor impact, but it won't solve the fundamental problem if the system is under heavy enough load or if RT threads are involved. As Petr mentioned, the best solution for those cases seems to be "convert the workqueue to an RT kthread_worker." I've done that many times on many different Android devices over the years for latency-critical work, especially around GPU, display, and camera. In the drain_local_pages case, I think it is triggered by userspace work and should respect priority; I don't think a prio 50 RT task should be blocked waiting on a prio 120 (or prio 100 if WQ_HIGHPRI) kworker to be scheduled so it can run drain_local_pages. If that's a reasonable claim, then I think moving drain_local_pages away from workqueues is the best choice.