On Mon, Mar 20, 2023 at 11:05 PM Tejun Heo <tj@xxxxxxxxxx> wrote: > > Do you remember what the other case was? Was it also on heterogenous arm > setup? Yup. See commit c25da5b7baf1 ("dm verity: stop using WQ_UNBOUND for verify_wq") But see also 3fffb589b9a6 ("erofs: add per-cpu threads for decompression as an option"). And you can see the confusion this all has in commit 43fa47cb116d ("dm verity: remove WQ_CPU_INTENSIVE flag since using WQ_UNBOUND"), which perhaps should be undone now. > There aren't many differences between unbound workqueues and percpu ones > that aren't concurrency managed. If there are significant performance > differences, it's unlikely to be directly from whatever workqueue is doing. There's a *lot* of special cases for WQ_UNBOUND in the workqueue code, and they are a lot less targeted than the other WQ_xyz flags, I feel. They have their own cpumask logic, special freeing rules etc etc. So I would say that the "aren't many differences" is not exactly true. There are subtle and random differences, including the very basic "queue_work()" workflow. Now, I assume that the arm cases don't actually use wq_unbound_cpumask, so I assume it's mostly the "instead of local cpu queue, use the local node queue", and so it's all on random CPU's since nobody uses NUMA nodes. And no, if it's caching effects, doing it on LLC boundaries isn't rigth *either*. By default it should probably be on L2 boundaries or something, with most non-NUMA setups likely having one single LLC but multiple L2 nodes. Linus