(Apologies for the re-send ... I neglected to turn of HTML and so linux-block bounced the email as spam.) On Tue, Jul 18, 2023 at 11:51 AM Mike Snitzer <snitzer@xxxxxxxxxx> wrote: But the long-standing dependency on VDO's work-queue data struct is still lingering (drivers/md/dm-vdo/work-queue.c). At a minimum we need to work toward pinning down _exactly_ why that is, and I think the best way to answer that is by simply converting the VDO code over to using Linux's workqueues. If doing so causes serious inherent performance (or functionality) loss then we need to understand why -- and fix Linux's workqueue code accordingly. (I've cc'd Tejun so he is aware). We tried this experiment and did indeed see some significant performance differences. Nearly a 7x slowdown in some cases. VDO can be pretty CPU-intensive. In addition to hashing and compression, it scans some big in-memory data structures as part of the deduplication process. Some data structures are split across one or more "zones" to enable concurrency (usually split based on bits of an address or something like that), but some are not, and a couple of those threads can sometimes exceed 50% CPU utilization, even 90% depending on the system and test data configuration. (Usually this is while pushing over 1GB/s through the deduplication and compression processing on a system with fast storage. On a slow VM with spinning storage, the CPU load is much smaller.) We use a sort of message-passing arrangement where a worker thread is responsible for updating certain data structures as needed for the I/Os in progress, rather than having the processing of each I/O contend for locks on the data structures. It gives us some good throughput under load but it does mean upwards of a dozen handoffs per 4kB write, depending on compressibility, whether the block is a duplicate, and various other factors. So processing 1 GB/s means handling over 3M messages per second, though each step of processing is generally lightweight. For our dedicated worker threads, it's not unusual for a thread to wake up and process a few tens or even hundreds of updates to its data structures (likely benefiting from CPU caching of the data structures) before running out of available work and going back to sleep. The experiment I ran was to create an ordered workqueue instead of each dedicated thread where we need serialization, and unordered workqueues when concurrency is allowed. On our slower test systems (> 10y old Supermicro Xeon E5-1650 v2, RAID-0 storage using SSDs or HDDs), the slowdown was less significant (under 2x), but on our faster system (4-5? year old Supermicro 1029P-WTR, 2x Xeon Gold 6128 = 12 cores, NVMe storage) we got nearly a 7x slowdown overall. I haven't yet dug deeply into _why_ the kernel work queues are slower in this sort of setup. I did run "perf top" briefly during one test with kernel work queues, and the largest single use of CPU cycles was in spin lock acquisition, but I didn't get call graphs. (This was with Fedora 37 6.2.12-200 and 6.2.15-200 kernels, without the latest submissions from Tejun, which look interesting. Though I suspect we care more about cache locality for some of our thread-specific data structures than for accessing the I/O structures.) Ken