Hi, I have a couple of questions regarding the necessity of including memory barriers when using io_uring, as outlined in https://kernel.dk/io_uring.pdf. I'm fine with using liburing, but still I do want to understand what is going on behind the scenes, so any comment would be appreciated. Firstly, I wonder why memory barriers are required at all, when NOT using polled mode. Because requiring them in non-polled mode somehow implies that: - Memory re-ordering occurs across system-call boundaries (i.e. when submitting, the tail write could happen after the io_uring_enter syscall?!) - CPU data dependency checks do not work So, are memory barriers really required when just using a simple loop around io_uring_enter with completely synchronous processing? Secondly, the examples in io_uring.pdf suggest that checking completion entries requires a read_barrier and a write_barrier and submitting entries requires *two* write_barriers. Really? My expectation would be, just as with "normal" inter-thread userspace ipc, that plain store-release and load-acquire semantics are sufficient, e.g.: - For reading completion entries: -- first read the CQ ring head (without any ordering enforcement) -- then use __atomic_load(__ATOMIC_ACQUIRE) to read the CQ ring tail -- then use __atomic_store(__ATOMIC_RELEASE) to update the CQ ring head - For submitting entries: -- first read the SQ ring tail (without any ordering enforcement) -- then use __atomic_load(__ATOMIC_ACQUIRE) to read the SQ ring head -- then use __atomic_store(__ATOMIC_RELEASE) to update the SQ ring tail Wouldn't these be sufficient?! Thirdly, io_uring.pdf and https://github.com/torvalds/linux/blob/master/io_uring/io_uring.c seem a little contradicting, at least from my reading: io_uring.pdf, in the completion entry example: - Includes a read_barrier() **BEFORE** it reads the CQ ring tail - Include a write_barrier() **AFTER** updating CQ head io_uring.c says on completion entries: - **AFTER** the application reads the CQ ring tail, it must use an appropriate smp_rmb() [...]. - It also needs a smp_mb() **BEFORE** updating CQ head [...]. io_uring.pdf, in the submission entry example: - Includes a write_barrier() **BEFORE** updating the SQ tail - Includes a write_barrier() **AFTER** updating the SQ tail io_uring.c says on submission entries: - [...] the application must use an appropriate smp_wmb() **BEFORE** writing the SQ tail (this matches io_uring.pdf) - And it needs a barrier ordering the SQ head load before writing new SQ entries I know, io_uring.pdf does mention that the memory ordering description is simplified. So maybe this is the whole explanation for my confusion? Cheers, Johann