Recently, commit 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") changed WQ_UNBOUND workqueue's behavior. It changed the meaning of alloc_workqueue()'s max_active from an upper limit imposed per NUMA node to a limit per CPU. As a result, massive number of workers can be running at the same time, especially if the workqueue user thinks the max_active is a global limit. Actually, it is already written it is per-CPU limit in the documentation before the commit. However, several callers seem to misuse max_active, maybe thinking it is a global limit. It is an unexpected behavior change for them. For example, these callers set max_active = num_online_cpus(), which is a suspicious limit applying to per-CPU. This config means we can have nr_cpu * nr_cpu active tasks working at the same time. fs/f2fs/data.c: sbi->post_read_wq = alloc_workqueue("f2fs_post_read_wq", fs/f2fs/data.c- WQ_UNBOUND | WQ_HIGHPRI, fs/f2fs/data.c- num_online_cpus()); fs/crypto/crypto.c: fscrypt_read_workqueue = alloc_workqueue("fscrypt_read_queue", fs/crypto/crypto.c- WQ_UNBOUND | WQ_HIGHPRI, fs/crypto/crypto.c- num_online_cpus()); fs/verity/verify.c: fsverity_read_workqueue = alloc_workqueue("fsverity_read_queue", fs/verity/verify.c- WQ_HIGHPRI, fs/verity/verify.c- num_online_cpus()); drivers/crypto/hisilicon/qm.c: qm->wq = alloc_workqueue("%s", WQ_HIGHPRI | WQ_MEM_RECLAIM | drivers/crypto/hisilicon/qm.c- WQ_UNBOUND, num_online_cpus(), drivers/crypto/hisilicon/qm.c- pci_name(qm->pdev)); block/blk-crypto-fallback.c: blk_crypto_wq = alloc_workqueue("blk_crypto_wq", block/blk-crypto-fallback.c- WQ_UNBOUND | WQ_HIGHPRI | block/blk-crypto-fallback.c- WQ_MEM_RECLAIM, num_online_cpus()); drivers/md/dm-crypt.c: cc->crypt_queue = alloc_workqueue("kcryptd/%s", drivers/md/dm-crypt.c- WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM | WQ_UNBOUND, drivers/md/dm-crypt.c- num_online_cpus(), devname); Furthermore, the change affects performance in a certain case. Btrfs creates several WQ_UNBOUND workqueues with a default max_active = min(NRCPUS + 2, 8). As my machine has 96 CPUs with NUMA disabled, this max_active config allows running over 700 active works. Before the commit, it is limited to 8 if NUMA is disabled or limited to 16 if NUMA nodes is 2. I reverted the workqueue code back to before the commit, and I ran the following fio command on RAID0 btrfs on 6 SSDs. fio --group_reporting --eta=always --eta-interval=30s --eta-newline=30s \ --rw=write --fallocate=none \ --direct=1 --ioengine=libaio --iodepth=32 \ --filesize=100G \ --blocksize=64k \ --time_based --runtime=300s \ --end_fsync=1 \ --directory=${MNT} \ --name=writer --numjobs=32 By changing workqueue's max_active, the result varies. - wq max_active=8 (intended limit by btrfs?) WRITE: bw=2495MiB/s (2616MB/s), 2495MiB/s-2495MiB/s (2616MB/s-2616MB/s), io=753GiB (808GB), run=308953-308953msec - wq max_active=16 (actual limit on 2 NUMA nodes setup) WRITE: bw=1736MiB/s (1820MB/s), 1736MiB/s-1736MiB/s (1820MB/s-1820MB/s), io=670GiB (720GB), run=395532-395532msec - wq max_active=768 (simulating current limit) WRITE: bw=1276MiB/s (1338MB/s), 1276MiB/s-1276MiB/s (1338MB/s-1338MB/s), io=375GiB (403GB), run=300984-300984msec The current performance is slower than the previous limit (max_active=16) by 27%, or it is 50% slower than the intended limit. The performance drop might be due to contention of the btrfs-endio-write works. There are over 700 kworker instances were created and 100 works are on the 'D' state competing for a lock. More specifically, I tested the same workload on the commit. - At commit 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") WRITE: bw=1191MiB/s (1249MB/s), 1191MiB/s-1191MiB/s (1249MB/s-1249MB/s), io=350GiB (376GB), run=300714-300714msec - At the previous commit = 4cbfd3de73 ("workqueue: Call wq_update_unbound_numa() on all CPUs in NUMA node on CPU hotplug") WRITE: bw=1747MiB/s (1832MB/s), 1747MiB/s-1747MiB/s (1832MB/s-1832MB/s), io=748GiB (803GB), run=438134-438134msec So, it is -31.8% performance down with the commit. In summary, we misuse max_active, considering it is a global limit. And, the recent commit introduced a huge performance drop in some cases. We need to review alloc_workqueue() usage to check if its max_active setting is proper or not.