Re: [PATCH 4/8] io_uring/io-wq: cache work->flags in variable

Pavel Begunkov <asml.silence@xxxxxxxxx> · Wed, 29 Jan 2025 18:57:00 +0000

On 1/28/25 13:39, Max Kellermann wrote:
This eliminates several redundant atomic reads and therefore reduces
the duration the surrounding spinlocks are held.

What architecture are you running? I don't get why the reads
are expensive while it's relaxed and there shouldn't even be
any contention. It doesn't even need to be atomics, we still
should be able to convert int back to plain ints.

In several io_uring benchmarks, this reduced the CPU time spent in
queued_spin_lock_slowpath() considerably:

io_uring benchmark with a flood of `IORING_OP_NOP` and `IOSQE_ASYNC`:

     38.86%     -1.49%  [kernel.kallsyms]  [k] queued_spin_lock_slowpath
      6.75%     +0.36%  [kernel.kallsyms]  [k] io_worker_handle_work
      2.60%     +0.19%  [kernel.kallsyms]  [k] io_nop
      3.92%     +0.18%  [kernel.kallsyms]  [k] io_req_task_complete
      6.34%     -0.18%  [kernel.kallsyms]  [k] io_wq_submit_work

HTTP server, static file:

     42.79%     -2.77%  [kernel.kallsyms]     [k] queued_spin_lock_slowpath
      2.08%     +0.23%  [kernel.kallsyms]     [k] io_wq_submit_work
      1.19%     +0.20%  [kernel.kallsyms]     [k] amd_iommu_iotlb_sync_map
      1.46%     +0.15%  [kernel.kallsyms]     [k] ep_poll_callback
      1.80%     +0.15%  [kernel.kallsyms]     [k] io_worker_handle_work

HTTP server, PHP:

     35.03%     -1.80%  [kernel.kallsyms]     [k] queued_spin_lock_slowpath
      0.84%     +0.21%  [kernel.kallsyms]     [k] amd_iommu_iotlb_sync_map
      1.39%     +0.12%  [kernel.kallsyms]     [k] _copy_to_iter
      0.21%     +0.10%  [kernel.kallsyms]     [k] update_sd_lb_stats

Signed-off-by: Max Kellermann <max.kellermann@xxxxxxxxx>

--
Pavel Begunkov