Re: Memory ordering description in io_uring.pdf

Jens Axboe <axboe@xxxxxxxxx> · Wed, 21 Sep 2022 19:54:30 -0600

On 9/18/22 10:56 AM, J. Hanne wrote:
> Hi,
> 
> I have a couple of questions regarding the necessity of including memory
> barriers when using io_uring, as outlined in
> https://kernel.dk/io_uring.pdf. I'm fine with using liburing, but still I
> do want to understand what is going on behind the scenes, so any comment
> would be appreciated.

In terms of the barriers, that doc is somewhat outdated...

> Firstly, I wonder why memory barriers are required at all, when NOT using
> polled mode. Because requiring them in non-polled mode somehow implies that:
> - Memory re-ordering occurs across system-call boundaries (i.e. when
>   submitting, the tail write could happen after the io_uring_enter
>   syscall?!)
> - CPU data dependency checks do not work
> So, are memory barriers really required when just using a simple
> loop around io_uring_enter with completely synchronous processing?

No, I don't beleive that they are. The exception is SQPOLL, as you mention,
as there's not necessarily a syscall involved with that.

> Secondly, the examples in io_uring.pdf suggest that checking completion
> entries requires a read_barrier and a write_barrier and submitting entries
> requires *two* write_barriers. Really?
> 
> My expectation would be, just as with "normal" inter-thread userspace ipc,
> that plain store-release and load-acquire semantics are sufficient, e.g.: 
> - For reading completion entries:
> -- first read the CQ ring head (without any ordering enforcement)
> -- then use __atomic_load(__ATOMIC_ACQUIRE) to read the CQ ring tail
> -- then use __atomic_store(__ATOMIC_RELEASE) to update the CQ ring head
> - For submitting entries:
> -- first read the SQ ring tail (without any ordering enforcement)
> -- then use __atomic_load(__ATOMIC_ACQUIRE) to read the SQ ring head
> -- then use __atomic_store(__ATOMIC_RELEASE) to update the SQ ring tail
> Wouldn't these be sufficient?!

Please check liburing to see what that does. Would be interested in
your feedback (and patches!). Largely x86 not caring too much about
these have meant that I think we've erred on the side of caution
on that front.

> Thirdly, io_uring.pdf and
> https://github.com/torvalds/linux/blob/master/io_uring/io_uring.c seem a
> little contradicting, at least from my reading:
> 
> io_uring.pdf, in the completion entry example:
> - Includes a read_barrier() **BEFORE** it reads the CQ ring tail
> - Include a write_barrier() **AFTER** updating CQ head
> 
> io_uring.c says on completion entries:
> - **AFTER** the application reads the CQ ring tail, it must use an appropriate
>   smp_rmb() [...].
> - It also needs a smp_mb() **BEFORE** updating CQ head [...].
> 
> io_uring.pdf, in the submission entry example:
> - Includes a write_barrier() **BEFORE** updating the SQ tail
> - Includes a write_barrier() **AFTER** updating the SQ tail
> 
> io_uring.c says on submission entries:
> - [...] the application must use an appropriate smp_wmb() **BEFORE**
>   writing the SQ tail
>   (this matches io_uring.pdf)
> - And it needs a barrier ordering the SQ head load before writing new
>   SQ entries
>   
> I know, io_uring.pdf does mention that the memory ordering description
> is simplified. So maybe this is the whole explanation for my confusion?

The canonical resource at this point is the kernel code, as some of
the revamping of the memory ordering happened way later than when
that doc was written. Would be nice to get it updated at some point.

-- 
Jens Axboe