Hi, what needs to be brought into an consistent state: - https://kernel.dk/io_uring.pdf (where is the source??) - https://git.kernel.dk/cgit/liburing/tree/man/io_uring.7 (https://manpages.debian.org/testing/liburing-dev/io_uring.7.en.html) - https://git.kernel.dk/cgit/liburing/tree/src/queue.c (using macros from https://git.kernel.dk/cgit/liburing/tree/src/include/liburing/barrier.h) - https://github.com/torvalds/linux/blob/master/io_uring/io_uring.c I’ll start with submission queue handling. Quoting myself, my (possibly naive) approach for submitting entries would be, using gcc atomic builtins in absence of standardized C functions: - (1) first read the SQ ring tail (without any ordering enforcement) - (2) then use __atomic_load(__ATOMIC_ACQUIRE) to read the SQ ring head - (3) then use __atomic_store(__ATOMIC_RELEASE) to update the SQ ring tail Comparing with the existing documentation, (3) matches everywhere: - io_uring.pdf confirms (3), as it inserts a write_barrier() before updating the ring tail - io_uring.7 confirms (3), as it uses atomic_store_release to update the ring tail - __io_uring_flush_sq in queue.c confirms (3), as it uses io_uring_smp_store_release to update the ring tail (BUT, __io_uring_flush_sq in queue.c also special cases IORING_SETUP_SQPOLL, which I do not fully understand) - io_uring.c says: "the application must use an appropriate smp_wmb() before writing the SQ tail (ordering SQ entry stores with the tail store)" However, (2) is not so clear: - io_uring.pdf never reads the ring head (but at least it mentions that the example is simplified as it is missing a queue full check) - io_uring.7 never reads the ring head (as it does not check if the ring is full, which it does not even mention) - __io_uring_flush_sq in queue.c confirms that, usually, acquire semantics are needed for reading the ring head, but seems to handle it elsewhere due to how it works internally (?) - io_uring.c says: “[the application] needs a barrier ordering the SQ head load before writing new SQ entries (smp_load_acquire to read head will do)." (BUT, it does not mention WHY the application needs to load the ring head) Lastly, I absolutely do not understand the second write_barrier in io_uring.pdf after updating the ring tail. https://git.kernel.dk/cgit/liburing/commit/?id=ecefd7958eb32602df07f12e9808598b2c2de84b more or less just removed it. Before removal, it had this comment: “The kernel has the matching read barrier for reading the SQ tail.“. Yes, the kernel does need such a read barrier, but the write barrier *before* the ring tail update should be enough?! So, my recommendation for documentation updates is: - In io_uring.pdf, remove the second write_barrier after the ring tail update. - In io_uring.pdf, augment the submission example with reading the ring head (to check for a queue-full condition), including a read_barrier after - In io_uring.7, also add a queue-full check - In io_uring.c extend the comment to say WHY the application needs to read the ring head Comments? Regards, Johann > Am 25.09.2022 um 12:34 schrieb J. Hanne <io_uring@xxxxxxxxxxxxx>: > > Hi, > >> Am 22.09.2022 um 03:54 schrieb Jens Axboe <axboe@xxxxxxxxx>: >> >> On 9/18/22 10:56 AM, J. Hanne wrote: >>> Hi, >>> >>> I have a couple of questions regarding the necessity of including memory >>> barriers when using io_uring, as outlined in >>> https://kernel.dk/io_uring.pdf. I'm fine with using liburing, but still I >>> do want to understand what is going on behind the scenes, so any comment >>> would be appreciated. >> >> In terms of the barriers, that doc is somewhat outdated... > Ok, that pretty much explains why I got an inconsistent view after studying multiple sources… > >> >>> Firstly, I wonder why memory barriers are required at all, when NOT using >>> polled mode. Because requiring them in non-polled mode somehow implies that: >>> - Memory re-ordering occurs across system-call boundaries (i.e. when >>> submitting, the tail write could happen after the io_uring_enter >>> syscall?!) >>> - CPU data dependency checks do not work >>> So, are memory barriers really required when just using a simple >>> loop around io_uring_enter with completely synchronous processing? >> >> No, I don't beleive that they are. The exception is SQPOLL, as you mention, >> as there's not necessarily a syscall involved with that. >> >>> Secondly, the examples in io_uring.pdf suggest that checking completion >>> entries requires a read_barrier and a write_barrier and submitting entries >>> requires *two* write_barriers. Really? >>> >>> My expectation would be, just as with "normal" inter-thread userspace ipc, >>> that plain store-release and load-acquire semantics are sufficient, e.g.: >>> - For reading completion entries: >>> -- first read the CQ ring head (without any ordering enforcement) >>> -- then use __atomic_load(__ATOMIC_ACQUIRE) to read the CQ ring tail >>> -- then use __atomic_store(__ATOMIC_RELEASE) to update the CQ ring head >>> - For submitting entries: >>> -- first read the SQ ring tail (without any ordering enforcement) >>> -- then use __atomic_load(__ATOMIC_ACQUIRE) to read the SQ ring head >>> -- then use __atomic_store(__ATOMIC_RELEASE) to update the SQ ring tail >>> Wouldn't these be sufficient?! >> >> Please check liburing to see what that does. Would be interested in >> your feedback (and patches!). Largely x86 not caring too much about >> these have meant that I think we've erred on the side of caution >> on that front. > Ok, I will check. My practical experience with memory barriers is limited however, so I’m not in the position to give a final judgement > >> >>> Thirdly, io_uring.pdf and >>> https://github.com/torvalds/linux/blob/master/io_uring/io_uring.c seem a >>> little contradicting, at least from my reading: >>> >>> io_uring.pdf, in the completion entry example: >>> - Includes a read_barrier() **BEFORE** it reads the CQ ring tail >>> - Include a write_barrier() **AFTER** updating CQ head >>> >>> io_uring.c says on completion entries: >>> - **AFTER** the application reads the CQ ring tail, it must use an appropriate >>> smp_rmb() [...]. >>> - It also needs a smp_mb() **BEFORE** updating CQ head [...]. >>> >>> io_uring.pdf, in the submission entry example: >>> - Includes a write_barrier() **BEFORE** updating the SQ tail >>> - Includes a write_barrier() **AFTER** updating the SQ tail >>> >>> io_uring.c says on submission entries: >>> - [...] the application must use an appropriate smp_wmb() **BEFORE** >>> writing the SQ tail >>> (this matches io_uring.pdf) >>> - And it needs a barrier ordering the SQ head load before writing new >>> SQ entries >>> >>> I know, io_uring.pdf does mention that the memory ordering description >>> is simplified. So maybe this is the whole explanation for my confusion? >> >> The canonical resource at this point is the kernel code, as some of >> the revamping of the memory ordering happened way later than when >> that doc was written. Would be nice to get it updated at some point. > Ok, I will try. Where is the io_uring.pdf source (tex? markdown??)? > > Regards, > Johann