Re: io_uring memory ordering (and broken queue-is-full checks)

Jens Axboe <axboe@xxxxxxxxx> · Mon, 22 Apr 2019 11:04:49 -0600

On 4/19/19 3:29 AM, Stefan Bühler wrote:
> Hi,
> 
> On 17.04.19 17:02, Jens Axboe wrote:
>> [...]
>>
>> Care to turn the pseudo code into a real patch? Would be easier to
>> digest. But thanks for taking a look at this!
> 
> Fair enough; I'm working on it.
> 
> I'll start with the actual bugs, and will try to cleanup unneeded
> barriers/docs later.

That sounds great, thanks. The three patches look good to me, I've
queued them up.

>>> The "userspace" liburing looks even worse.  For starters,
>>> `io_uring_get_completion` cannot safely return a pointer to the CQ entry
>>> *AND* increment head: you need to actually read the entry before
>>> incrementing head.
>>
>> That's a good point, we should actually have this split in two to avoid
>> the case where the kernel will then fill in a new entry for us.
> 
> I see you already did so in the liburing repo.

Yep

> When going through liburing code again, I noticed io_uring_submit
> assumes there might be pending submission entries in the SQ queue.
> 
> But those entries are not "reserved" in the SQE queue; so
> io_uring_get_sqe might overwrite SQE data that isn't read by the kernel
> through SQ yet.
> 
> Is there any motivation behind the indirection?  I see two possible ideas:
> - allocate fixed SQE entries for operations, and just add them to SQ
>   when needed
> - have multiple threads fill SQE in parallel, and only add them to SQ
>   when done
> 
> Are those actually worth the cost?
> 
> liburing doesn't look like it is going to take advantage of it, and the
> library I'm writing in Rust right now isn't going to either (actually
> even using a fixed sq_ndx == sqe_ndx mapping, i.e. it'll only write
> sq->array on startup).
> 
> At least consider documenting the motivation :)

I'll take a look at this case and see if the complexity is warranted,
or at least make sure that it's exposed in a safe fashion.

>>> Last but not least the kernel side has two lines it checks whether a
>>> queue is full or not and uses `tail + 1 == head`: this is only true if
>>> there were U32_MAX entries in the queue.  You should check `(tail -
>>> head) == SIZE` instead.
>>
>> Good catch, this is a leftover from when the rings were indeed capped by
>> the mask of the entries. I've fixed this one and pushed it out, and
>> added a liburing test case for it as well.
> 
> I think you missed the second one in io_uring_poll (looking at your
> for-linus branch), so I will send a patch for that too.

I did indeed, my grep failed me. Thanks for catching that!

-- 
Jens Axboe