On 1/28/25 13:39, Max Kellermann wrote:
This eliminates several redundant atomic reads and therefore reduces the duration the surrounding spinlocks are held.
What architecture are you running? I don't get why the reads are expensive while it's relaxed and there shouldn't even be any contention. It doesn't even need to be atomics, we still should be able to convert int back to plain ints.
In several io_uring benchmarks, this reduced the CPU time spent in queued_spin_lock_slowpath() considerably: io_uring benchmark with a flood of `IORING_OP_NOP` and `IOSQE_ASYNC`: 38.86% -1.49% [kernel.kallsyms] [k] queued_spin_lock_slowpath 6.75% +0.36% [kernel.kallsyms] [k] io_worker_handle_work 2.60% +0.19% [kernel.kallsyms] [k] io_nop 3.92% +0.18% [kernel.kallsyms] [k] io_req_task_complete 6.34% -0.18% [kernel.kallsyms] [k] io_wq_submit_work HTTP server, static file: 42.79% -2.77% [kernel.kallsyms] [k] queued_spin_lock_slowpath 2.08% +0.23% [kernel.kallsyms] [k] io_wq_submit_work 1.19% +0.20% [kernel.kallsyms] [k] amd_iommu_iotlb_sync_map 1.46% +0.15% [kernel.kallsyms] [k] ep_poll_callback 1.80% +0.15% [kernel.kallsyms] [k] io_worker_handle_work HTTP server, PHP: 35.03% -1.80% [kernel.kallsyms] [k] queued_spin_lock_slowpath 0.84% +0.21% [kernel.kallsyms] [k] amd_iommu_iotlb_sync_map 1.39% +0.12% [kernel.kallsyms] [k] _copy_to_iter 0.21% +0.10% [kernel.kallsyms] [k] update_sd_lb_stats Signed-off-by: Max Kellermann <max.kellermann@xxxxxxxxx>
-- Pavel Begunkov