Hi, For v1 and replies to that and tons of perf measurements, go here: https://lore.kernel.org/io-uring/3d553205-0fe2-482e-8d4c-a4a1ad278893@xxxxxxxxx/T/#m12f44c0a9ee40a59b0dcc226e22a0d031903aa73 and find v2 here: https://lore.kernel.org/io-uring/20240530152822.535791-2-axboe@xxxxxxxxx/ and you can find the git tree here: https://git.kernel.dk/cgit/linux/log/?h=io_uring-msg_ring Patches are based on top of current Linus -git, with the 6.10 and 6.11 pending io_uring changes pulled in. tldr is that this series greatly improves both latency, overhead, and throughput of sending messages to other rings. It's done by using the CQE overflow framework rather than attempting to local remote rings, which can potentially cause spurious -EAGAIN and io-wq usage. Outside of that, it also unifies how message posting is done, ending up with a single method across target ring types. Some select performance results: Sender using 10 usec delay, sending ~100K messages per second: Pre-patches: Latencies for: Sender (msg=131950) percentiles (nsec): | 1.0000th=[ 1896], 5.0000th=[ 2064], 10.0000th=[ 2096], | 20.0000th=[ 2192], 30.0000th=[ 2352], 40.0000th=[ 2480], | 50.0000th=[ 2544], 60.0000th=[ 2608], 70.0000th=[ 2896], | 80.0000th=[ 2992], 90.0000th=[ 3376], 95.0000th=[ 3472], | 99.0000th=[ 3568], 99.5000th=[ 3728], 99.9000th=[ 6880], | 99.9500th=[14656], 99.9900th=[42752] Latencies for: Receiver (msg=131950) percentiles (nsec): | 1.0000th=[ 1160], 5.0000th=[ 1288], 10.0000th=[ 1336], | 20.0000th=[ 1384], 30.0000th=[ 1448], 40.0000th=[ 1624], | 50.0000th=[ 1688], 60.0000th=[ 1736], 70.0000th=[ 1768], | 80.0000th=[ 1848], 90.0000th=[ 2256], 95.0000th=[ 2320], | 99.0000th=[ 2416], 99.5000th=[ 2480], 99.9000th=[ 3184], | 99.9500th=[14400], 99.9900th=[18304] Expected messages: 299882 and with the patches: Latencies for: Sender (msg=247931) percentiles (nsec): | 1.0000th=[ 181], 5.0000th=[ 191], 10.0000th=[ 201], | 20.0000th=[ 211], 30.0000th=[ 231], 40.0000th=[ 262], | 50.0000th=[ 290], 60.0000th=[ 322], 70.0000th=[ 390], | 80.0000th=[ 482], 90.0000th=[ 748], 95.0000th=[ 892], | 99.0000th=[ 1032], 99.5000th=[ 1096], 99.9000th=[ 1336], | 99.9500th=[ 1512], 99.9900th=[ 1992] Latencies for: Receiver (msg=247931) percentiles (nsec): | 1.0000th=[ 350], 5.0000th=[ 382], 10.0000th=[ 410], | 20.0000th=[ 482], 30.0000th=[ 572], 40.0000th=[ 652], | 50.0000th=[ 764], 60.0000th=[ 860], 70.0000th=[ 1080], | 80.0000th=[ 1480], 90.0000th=[ 1768], 95.0000th=[ 1896], | 99.0000th=[ 2448], 99.5000th=[ 2576], 99.9000th=[ 3184], | 99.9500th=[ 3792], 99.9900th=[17280] Expected messages: 299926 which is a ~8.7x improvement for 50th latency percentile for the sender, and ~3.5x for the 99th percentile, and a ~2.2x receiver side improvement for the 50th percentile. Higher percentiels for the receiver are pretty similar, but note that this is accomplished with the throughput being almost twice that of before (~248K messages over 3 seconds vs ~132K before). Using a 20 usec message delay, targeting 50K messages per second, the latency picture is close to the same as above. However, pre patches we get ~110K messages and after we get ~142K messages. Pre patches is ~37% off the target rate, with the patches we're within 5% of the target. One interesting use case for message passing is sending work items between rings. For example, you can have a ring that accepts connections and then passes them to worker threads that have their own ring. Or you can have threads that receive data and needs to pass a work item for processing to another thread. Normally that would be done with some kind of queue with serialization, and then a remote wakeup with eg epoll on the other end and using eventfd. That isn't very efficient. With message passing, you can simply hand over the work item rather than need to manage both a queue and a wakeup mechanism in userspace. include/linux/io_uring_types.h | 8 ++ io_uring/io_uring.c | 33 ++---- io_uring/io_uring.h | 44 +++++++ io_uring/msg_ring.c | 211 +++++++++++++++++---------------- io_uring/msg_ring.h | 3 + 5 files changed, 176 insertions(+), 123 deletions(-) Changes since v2: - Add wakeup batching for MSG_RING with DEFER_TASKRUN by refactoring the helpers that we use for local task_work. - Drop patch splitting fd installing into a separate helper, as we just remove it at the end anyway when the old MSG_RING posting code is removed. - Little cleanups -- Jens Axboe