Re: Memory ordering description in io_uring.pdf

"J. Hanne" <io_uring@xxxxxxxxxxxxx> · Sun, 25 Sep 2022 14:03:11 +0200

Hi,

what needs to be brought into an consistent state:
- https://kernel.dk/io_uring.pdf (where is the source??)
- https://git.kernel.dk/cgit/liburing/tree/man/io_uring.7 (https://manpages.debian.org/testing/liburing-dev/io_uring.7.en.html)
- https://git.kernel.dk/cgit/liburing/tree/src/queue.c (using macros from https://git.kernel.dk/cgit/liburing/tree/src/include/liburing/barrier.h)
- https://github.com/torvalds/linux/blob/master/io_uring/io_uring.c

I’ll start with submission queue handling.

Quoting myself, my (possibly naive) approach for submitting entries would be, using gcc atomic builtins in absence of standardized C functions:
- (1) first read the SQ ring tail (without any ordering enforcement)
- (2) then use __atomic_load(__ATOMIC_ACQUIRE) to read the SQ ring head
- (3) then use __atomic_store(__ATOMIC_RELEASE) to update the SQ ring tail

Comparing with the existing documentation, (3) matches everywhere:
- io_uring.pdf confirms (3), as it inserts a write_barrier() before updating the ring tail
- io_uring.7 confirms (3), as it uses atomic_store_release to update the ring tail
- __io_uring_flush_sq in queue.c confirms (3), as it uses io_uring_smp_store_release to update the ring tail
  (BUT, __io_uring_flush_sq in queue.c also special cases IORING_SETUP_SQPOLL, which I do not fully understand)
- io_uring.c says: "the application must use an appropriate smp_wmb() before writing the SQ tail (ordering SQ entry stores with the tail store)"

However, (2) is not so clear:
- io_uring.pdf never reads the ring head (but at least it mentions that the example is simplified as it is missing a queue full check)
- io_uring.7 never reads the ring head (as it does not check if the ring is full, which it does not even mention)
- __io_uring_flush_sq in queue.c confirms that, usually, acquire semantics are needed for reading the ring head, but seems to handle it elsewhere due to how it works internally (?)
- io_uring.c says: “[the application] needs a barrier ordering the SQ head load before writing new SQ entries (smp_load_acquire to read head will do)."
  (BUT, it does not mention WHY the application needs to load the ring head)

Lastly, I absolutely do not understand the second write_barrier in io_uring.pdf after updating the ring tail. https://git.kernel.dk/cgit/liburing/commit/?id=ecefd7958eb32602df07f12e9808598b2c2de84b more or less just removed it. Before removal, it had this comment: “The kernel has the matching read barrier for reading the SQ tail.“. Yes, the kernel does need such a read barrier, but the write barrier *before* the ring tail update should be enough?!

So, my recommendation for documentation updates is:
- In io_uring.pdf, remove the second write_barrier after the ring tail update.
- In io_uring.pdf, augment the submission example with reading the ring head (to check for a queue-full condition), including a read_barrier after
- In io_uring.7, also add a queue-full check
- In io_uring.c extend the comment to say WHY the application needs to read the ring head

Comments?

Regards,
  Johann

> Am 25.09.2022 um 12:34 schrieb J. Hanne <io_uring@xxxxxxxxxxxxx>:
> 
> Hi,
> 
>> Am 22.09.2022 um 03:54 schrieb Jens Axboe <axboe@xxxxxxxxx>:
>> 
>> On 9/18/22 10:56 AM, J. Hanne wrote:
>>> Hi,
>>> 
>>> I have a couple of questions regarding the necessity of including memory
>>> barriers when using io_uring, as outlined in
>>> https://kernel.dk/io_uring.pdf. I'm fine with using liburing, but still I
>>> do want to understand what is going on behind the scenes, so any comment
>>> would be appreciated.
>> 
>> In terms of the barriers, that doc is somewhat outdated...
> Ok, that pretty much explains why I got an inconsistent view after studying multiple sources…
> 
>> 
>>> Firstly, I wonder why memory barriers are required at all, when NOT using
>>> polled mode. Because requiring them in non-polled mode somehow implies that:
>>> - Memory re-ordering occurs across system-call boundaries (i.e. when
>>> submitting, the tail write could happen after the io_uring_enter
>>> syscall?!)
>>> - CPU data dependency checks do not work
>>> So, are memory barriers really required when just using a simple
>>> loop around io_uring_enter with completely synchronous processing?
>> 
>> No, I don't beleive that they are. The exception is SQPOLL, as you mention,
>> as there's not necessarily a syscall involved with that.
>> 
>>> Secondly, the examples in io_uring.pdf suggest that checking completion
>>> entries requires a read_barrier and a write_barrier and submitting entries
>>> requires *two* write_barriers. Really?
>>> 
>>> My expectation would be, just as with "normal" inter-thread userspace ipc,
>>> that plain store-release and load-acquire semantics are sufficient, e.g.: 
>>> - For reading completion entries:
>>> -- first read the CQ ring head (without any ordering enforcement)
>>> -- then use __atomic_load(__ATOMIC_ACQUIRE) to read the CQ ring tail
>>> -- then use __atomic_store(__ATOMIC_RELEASE) to update the CQ ring head
>>> - For submitting entries:
>>> -- first read the SQ ring tail (without any ordering enforcement)
>>> -- then use __atomic_load(__ATOMIC_ACQUIRE) to read the SQ ring head
>>> -- then use __atomic_store(__ATOMIC_RELEASE) to update the SQ ring tail
>>> Wouldn't these be sufficient?!
>> 
>> Please check liburing to see what that does. Would be interested in
>> your feedback (and patches!). Largely x86 not caring too much about
>> these have meant that I think we've erred on the side of caution
>> on that front.
> Ok, I will check. My practical experience with memory barriers is limited however, so I’m not in the position to give a final judgement
> 
>> 
>>> Thirdly, io_uring.pdf and
>>> https://github.com/torvalds/linux/blob/master/io_uring/io_uring.c seem a
>>> little contradicting, at least from my reading:
>>> 
>>> io_uring.pdf, in the completion entry example:
>>> - Includes a read_barrier() **BEFORE** it reads the CQ ring tail
>>> - Include a write_barrier() **AFTER** updating CQ head
>>> 
>>> io_uring.c says on completion entries:
>>> - **AFTER** the application reads the CQ ring tail, it must use an appropriate
>>> smp_rmb() [...].
>>> - It also needs a smp_mb() **BEFORE** updating CQ head [...].
>>> 
>>> io_uring.pdf, in the submission entry example:
>>> - Includes a write_barrier() **BEFORE** updating the SQ tail
>>> - Includes a write_barrier() **AFTER** updating the SQ tail
>>> 
>>> io_uring.c says on submission entries:
>>> - [...] the application must use an appropriate smp_wmb() **BEFORE**
>>> writing the SQ tail
>>> (this matches io_uring.pdf)
>>> - And it needs a barrier ordering the SQ head load before writing new
>>> SQ entries
>>> 
>>> I know, io_uring.pdf does mention that the memory ordering description
>>> is simplified. So maybe this is the whole explanation for my confusion?
>> 
>> The canonical resource at this point is the kernel code, as some of
>> the revamping of the memory ordering happened way later than when
>> that doc was written. Would be nice to get it updated at some point.
> Ok, I will try. Where is the io_uring.pdf source (tex? markdown??)?
> 
> Regards,
>  Johann