Hello, (cc'ing Lai.) On Mon, Mar 20, 2023 at 03:31:13PM -0700, Linus Torvalds wrote: > On Mon, Mar 20, 2023 at 2:07 PM Eric Biggers <ebiggers@xxxxxxxxxx> wrote: > > > > Nathan Huckleberry (1): > > fsverity: Remove WQ_UNBOUND from fsverity read workqueue > > There's a *lot* of other WQ_UNBOUND users. If it performs that badly, > maybe there is something wrong with the workqueue code. > > Should people be warned to not use WQ_UNBOUND - or is there something > very special about fsverity? > > Added Tejun to the cc. With one of the main documented reasons for > WQ_UNBOUND being performance (both implicit "try to start execution of > work items as soon as possible") and explicit ("CPU intensive > workloads which can be better managed by the system scheduler"), maybe > it's time to reconsider? > > WQ_UNBOUND adds a fair amount of complexity and special cases to the > workqueues, and this is now the second "let's remove it because it's > hurting things in a big way". Do you remember what the other case was? Was it also on heterogenous arm setup? There aren't many differences between unbound workqueues and percpu ones that aren't concurrency managed. If there are significant performance differences, it's unlikely to be directly from whatever workqueue is doing. One obvious thing that comes to mind is that WQ_UNBOUND may be pushing tasks across expensive cache boundaries (e.g. across cores that are living on separate L3 complexes). This isn't a totally new problem and workqueue has some topology awareness, by default, WQ_UNBOUND pools are segregated across NUMA boundaries. This used to be fine but I think it's likely outmoded now. given that non-trivial cache hierarchies on top of UMA or inside a node are a thing these days. Looking at f959325e6ac3 ("fsverity: Remove WQ_UNBOUND from fsverity read workqueue"), I feel a bit uneasy. This would be fine on a setup which does moderate amount of IOs on CPUs with quick enough accelration mechanisms, but that's not the whole world. Use cases that generate extreme amount of IOs do depend on the ability to fan out IO related work items across multiple CPUs especially if the IOs coincide with network activities. So, my intuition is that the commit is fixing a subset of use cases while likely regressing others. If the cache theory is correct, the right thing to do would be making workqueue init code a bit smarter so that it segements unbound pools on LLC boundaries rather than NUMA, which would make more sense on recent AMD chips too. Nathan, can you run `hwloc-ls` on the affected setup (or `lstopo out.pdf`) and attach the output? As for the overhead of supporting WQ_UNBOUND, it does add non-trivial amount of complexity but of the boring kind. It's all managerial stuff which isn't too difficult to understand and relatively easy to understand and fix when something goes wrong, so it isn't expensive in terms of supportability and it does address classes of significant use cases, so I think we should just fix it. Thanks. -- tejun