On Wed, Mar 22, 2023 at 6:04 PM Tejun Heo <tj@xxxxxxxxxx> wrote: > > Thanks for the pointers. They all seem plausible symptoms of work items > getting bounced across slow cache boundaries. I'm off for a few weeks so > can't really dig in right now but will get to it afterwards. So just as a gut feeling, I suspect that one solution would be to always *start* the work on the local CPU (where "local" might be the same, or at least a sibling). The only reason to migrate to another CPU would be if the work is CPU-intensive, and I do suspect that is commonly not really the case. And I strongly suspect that our WQ_CPU_INTENSIVE flag is pure garbage, and should just be gotten rid of, because what could be considered "CPU intensive" in under one situation might not be CPU intensive in another one, so trying to use some static knowledge about it is just pure guess-work. The different situations might be purely contextual things ("heavy network traffic when NAPI polling kicks in"), but it might also be purely hardware-related (ie "this is heavy if we don't have CPU hw acceleration for crypto, but cheap if we do"). So I really don't think it should be some static decision, either through WQ_CPU_INTENSIVE _or_ through "WQ_UNBOUND means schedule on first available CPU". Wouldn't it be much nicer if we just noticed it dynamically, and WQ_UNBOUND would mean that the workqueue _can_ be scheduled on another CPU if it ends up being advantageous? And we actually kind of have that dynamic flag already, in the form of the scheduler. It might even be explicit in the context of the workqueue (with "need_resched()" being true and the workqueue code itself might notice it and explicitly then try to spread it out), but with preemption it's more implicit and maybe it needs a bit of tweaking help. So that's what I mean by "start the work as local CPU work" - use that as the baseline decision (since it's going to be the case that has cache locality), and actively try to avoid spreading things out unless we have an explicit reason to, and that reason we could just get from the scheduler. The worker code already has that "wq_worker_sleeping()" callback from the scheduler, but that only triggers when a worker is going to sleep. I'm saying that the "scheduler decided to schedule out a worker" case might be used as a "Oh, this is CPU intensive, let's try to spread it out". See what I'm trying to say? And yes, the WQ_UNBOUND case does have a weak "prefer local CPU" in how it basically tends to try to pick the current CPU unless there is some active reason not to (ie the whole "wq_select_unbound_cpu()" code), but I suspect that is then counter-acted by the fact that it will always pick the workqueue pool by node - so the "current CPU" ends up probably being affected by what random CPU that pool was running on. An alternative to any scheduler interaction thing might be to just tweak "first_idle_worker()". I get the feeling that that choice is just horrid, and that is another area that could really try to take locality into account. insert_work() realyl seems to pick a random worker from the pool - which is fine when the pool is per-cpu, but when it's the unbound "random node" pool, I really suspect that it might be much better to try to pick a worker that is on the right cpu. But hey, I may be missing something. You know this code much better than I do. Linus