Hello, On Wed, Apr 10, 2024 at 05:45:31PM +0900, Shin'ichiro Kawasaki wrote: > Commit 5797b1c18919 ("workqueue: Implement system-wide nr_active > enforcement for unbound workqueues") modified the maximum number of > active works that an unbound workqueue can handle to at most > WQ_DFL_MIN_ACTIVE (8 by default). This commit thus limits the number of This shouldn't be the case. The default max_active remains the same at 256. MIN_ACTIVE is used only to guarantee minimum forward progress guarantee when @max_active too low in multi NUMA setups. It's unexpected that the commit caused significant behavior difference on a single NUMA machines. The limits and enforcement for single NUMA machien shouldn't have changed. > active dm-zoned chunk works that execute concurrently on a single NUMA > node machine. This reduction results in garbage collection performance > degradation which manifests itself with longer unmount time with the xfs > file system on dm-zoned devices. > > To restore unmount duration with dm-zoned devices, drop the WQ_UNBOUND > flag for the chunk workqueue, thus allowing more than WQ_DFL_MIN_ACTIVE > chunk works. Though this change bounds all chunk works to the same CPU, > it provides more parallelism and improved performance. The table below > shows the average xfs unmount time of 10 times measurements, using a > single NUMA node machine with 32 CPUs. The xfs volume was prepared on > dm-zoned devices on top of an SMR HDD with 26GB dm-linear clip, then > filled with data files before executing unmount. > > Kernel | Unmount time > ---------------------+-------------- > v6.8 | 29m 3s > v6.9-rc2 | 34m 17s > v6.9-rc2 + this fix | 27m 12s Can you please run `drgn tools/workqueue/wq_monitor.py 'dmz_cwq.*'` while testing? It should show how many work items are in flight and how much CPU time the workqueues are consuming. Thanks. -- tejun