So, I realize that this probably isn't something that you've looked at yet. But, I was interested in a different criteria looking at io_uring. That is how efficient it is for small numbers of requests which don't transfer much data. In other words, what is the minimum amount of io_uring work for which a program speed-up can be obtained. I realize that this is highly dependent on how much overlap can be gained with async processing. In order to get a baseline, I wrote a test program which performs 4 opens, followed by 4 read + closes. For the baseline I intentionally used files in /proc so that there would be minimum async and I could set IOSQE_ASYNC later. I was quite surprised by the result: Almost the entire program wall time was used in the io_uring_queue_exit() call. I wrote another test program which does just inits followed by exits. There are clock_gettime()s around the io_uring_queue_init(8, &ring, 0) and io_uring_queue_exit() calls and I printed the ratio of the io_uring_queue_exit() elapsed time and the sum of elapsed time of both calls. The result varied between 0.94 and 0.99. In other words, exit is between 16 and 100 times slower than init. Average ratio was around 0.97. Looking at the liburing code, exit does just what I'd expect (unmap pages and close io_uring fd). I would have bet the ratio would be less than 0.50. No operations were ever performed by the ring, so there should be minimal cleanup. Even if the kernel needed to do a bunch of cleanup, it shouldn't need the pages mapped into user space to work; same thing for the fd being open in the user process. Seems like there is some room for optimization here.