Hi, Hi, For v1 and replies to that and tons of perf measurements, go here: https://lore.kernel.org/io-uring/3d553205-0fe2-482e-8d4c-a4a1ad278893@xxxxxxxxx/ and find v2 here: https://lore.kernel.org/io-uring/20240530152822.535791-2-axboe@xxxxxxxxx/ and v3 here: https://lore.kernel.org/io-uring/20240605141933.11975-1-axboe@xxxxxxxxx/ and you can find the git tree here: https://git.kernel.dk/cgit/linux/log/?h=io_uring-msg-ring.1 and the silly test app being used here: https://kernel.dk/msg-lat.c Patches are based on top of the pending 6.11 io_uring changes. tldr is that this series greatly improves both latency, overhead, and throughput of sending messages to other rings. It's done by using the existing io_uring task_work for passing messages, rather than utilize the rather big hammer of TWA_SIGNAL based generic kernel task_work. Note that this differs from v3 of this posting, as that used the CQE overflow approach. While the CQE overflow approach still performs a bit better than this approach, this one is a bit cleaner. Performance for local (same node CPUs) message passing before this change: init_flags=3000, delay=10 usec latencies for: receiver (msg=82631) percentiles (nsec): | 1.0000th=[ 3088], 5.0000th=[ 3088], 10.0000th=[ 3120], | 20.0000th=[ 3248], 30.0000th=[ 3280], 40.0000th=[ 3312], | 50.0000th=[ 3408], 60.0000th=[ 3440], 70.0000th=[ 3472], | 80.0000th=[ 3504], 90.0000th=[ 3600], 95.0000th=[ 3696], | 99.0000th=[ 6368], 99.5000th=[ 6496], 99.9000th=[ 6880], | 99.9500th=[ 7008], 99.9900th=[12352] latencies for: sender (msg=82631) percentiles (nsec): | 1.0000th=[ 5280], 5.0000th=[ 5280], 10.0000th=[ 5344], | 20.0000th=[ 5408], 30.0000th=[ 5472], 40.0000th=[ 5472], | 50.0000th=[ 5600], 60.0000th=[ 5600], 70.0000th=[ 5664], | 80.0000th=[ 5664], 90.0000th=[ 5792], 95.0000th=[ 5920], | 99.0000th=[ 8512], 99.5000th=[ 8640], 99.9000th=[ 8896], | 99.9500th=[ 9280], 99.9900th=[19840] and after: init_flags=3000, delay=10 usec Latencies for: Sender (msg=236763) percentiles (nsec): | 1.0000th=[ 225], 5.0000th=[ 245], 10.0000th=[ 278], | 20.0000th=[ 294], 30.0000th=[ 330], 40.0000th=[ 378], | 50.0000th=[ 418], 60.0000th=[ 466], 70.0000th=[ 524], | 80.0000th=[ 604], 90.0000th=[ 708], 95.0000th=[ 804], | 99.0000th=[ 1864], 99.5000th=[ 2480], 99.9000th=[ 2768], | 99.9500th=[ 2864], 99.9900th=[ 3056] Latencies for: Receiver (msg=236763) percentiles (nsec): | 1.0000th=[ 764], 5.0000th=[ 940], 10.0000th=[ 1096], | 20.0000th=[ 1416], 30.0000th=[ 1736], 40.0000th=[ 2040], | 50.0000th=[ 2352], 60.0000th=[ 2704], 70.0000th=[ 3152], | 80.0000th=[ 3856], 90.0000th=[ 4960], 95.0000th=[ 6176], | 99.0000th=[ 8032], 99.5000th=[ 8256], 99.9000th=[ 8768], | 99.9500th=[10304], 99.9900th=[91648] and for remote (different nodes) CPUs, before: init_flags=3000, delay=10 usec Latencies for: Receiver (msg=44002) percentiles (nsec): | 1.0000th=[ 7264], 5.0000th=[ 8384], 10.0000th=[ 8512], | 20.0000th=[ 8640], 30.0000th=[ 8896], 40.0000th=[ 9024], | 50.0000th=[ 9152], 60.0000th=[ 9280], 70.0000th=[ 9408], | 80.0000th=[ 9536], 90.0000th=[ 9792], 95.0000th=[ 9920], | 99.0000th=[10304], 99.5000th=[13376], 99.9000th=[19840], | 99.9500th=[20608], 99.9900th=[25728] Latencies for: Sender (msg=44002) percentiles (nsec): | 1.0000th=[11712], 5.0000th=[12864], 10.0000th=[12864], | 20.0000th=[13120], 30.0000th=[13248], 40.0000th=[13376], | 50.0000th=[13504], 60.0000th=[13760], 70.0000th=[13888], | 80.0000th=[14144], 90.0000th=[14272], 95.0000th=[14400], | 99.0000th=[15936], 99.5000th=[21632], 99.9000th=[24704], | 99.9500th=[25984], 99.9900th=[37632] and after the changes: init_flags=3000, delay=10 usec Latencies for: Sender (msg=192598) percentiles (nsec): | 1.0000th=[ 402], 5.0000th=[ 430], 10.0000th=[ 446], | 20.0000th=[ 482], 30.0000th=[ 700], 40.0000th=[ 804], | 50.0000th=[ 932], 60.0000th=[ 1176], 70.0000th=[ 1304], | 80.0000th=[ 1480], 90.0000th=[ 1752], 95.0000th=[ 2128], | 99.0000th=[ 2736], 99.5000th=[ 2928], 99.9000th=[ 4256], | 99.9500th=[ 8768], 99.9900th=[12864] Latencies for: Receiver (msg=192598) percentiles (nsec): | 1.0000th=[ 2024], 5.0000th=[ 2544], 10.0000th=[ 2928], | 20.0000th=[ 3600], 30.0000th=[ 4048], 40.0000th=[ 4448], | 50.0000th=[ 4896], 60.0000th=[ 5408], 70.0000th=[ 5920], | 80.0000th=[ 6752], 90.0000th=[ 7904], 95.0000th=[ 9408], | 99.0000th=[10816], 99.5000th=[11712], 99.9000th=[16320], | 99.9500th=[18304], 99.9900th=[22656] include/linux/io_uring_types.h | 3 + io_uring/io_uring.c | 53 ++++++++++++--- io_uring/io_uring.h | 3 + io_uring/msg_ring.c | 119 ++++++++++++++++++++------------- io_uring/msg_ring.h | 1 + 5 files changed, 124 insertions(+), 55 deletions(-) Since v3: - Switch back to task_work approach, rather than utilize overflows for this - Retain old task_work approach for fd passing - Various tweaks and cleanups -- Jens Axboe