Re: io_uring question

Filipp Mikoian <Filipp.Mikoian@xxxxxxxxxxx> · Wed, 17 Jul 2019 14:56:17 +0000

> Can you try the attached patch and see if it fixes it for you?

Thank you very much, that worked like a charm for both O_DIRECT and page
cache. Below is the output for O_DIRECT reads submission on the same machine:

root@localhost:~/io_uring# ./io_uring_read_blkdev /dev/sda8
submitted_already =   0, submitted_now =  32, submit_time =     277 us
submitted_already =  32, submitted_now =  32, submit_time =     131 us
submitted_already =  64, submitted_now =  32, submit_time =     213 us
submitted_already =  96, submitted_now =  32, submit_time =     170 us
submitted_already = 128, submitted_now =  32, submit_time =     161 us
submitted_already = 160, submitted_now =  32, submit_time =     169 us
submitted_already = 192, submitted_now =  32, submit_time =     184 us

> Not sure how best to convery that bit of information. If you're using
> the sq thread for submission, then we cannot reliably tell the
> application when an sqe has been consumed. The application must look for
> completions (successful or errors) in the CQ ring.

I know that SQPOLL feature support is not fully implemented in liburing,
so for now it seems that io_uring_get_sqe() could return not actually
submitted SQE, editing which could lead to race between kernel polling
thread and user space. I just think it is worth mentioning this fact in
documentation.

> You could wait on cq ring completions, each sqe should trigger one.

Unfortunately few issues seem to arise if this approach is taken in
IO-intensive application. As a disclaimer I should note that SQ ring
overflow is a rare event given enough entries, nevertheless applications,
especially those using SQPOLL, should handle this situation gracefully
and in a performant manner.

So what we have is highly IO-intensive application that submits very
slow IOs*** (that's why it uses async IO in the first place) and
cares much about the progress of the submitting threads(the most probable
reason to use SQPOLL feature). Given such prerequisites, the following
scenario is probable:

*** by 'very slow' I mean IOs, completion of which takes significantly
    more time than submission

1. Put @sq_entries with very slow IOs in SQ...
  PENDING      SQ     INFLIGHT       CQ
   +---+     +---+     +---+     +---+---+
============>| X |     |   |     |   |   |
   +---+     +---+     +---+     +---+---+
   ...which will be submitted by polling thread
  PENDING      SQ     INFLIGHT       CQ
   +---+     +---+     +---+     +---+---+
   |   |     |   |====>| X |     |   |   |
   +---+     +---+     +---+     +---+---+
2. Then try to add (@sq_entries + @pending) entries to SQ, but only
   succeed with @sq_entries.
  PENDING      SQ     INFLIGHT       CQ
   +---+     +---+     +---+     +---+---+
==>| X |====>| X |     | X |     |   |   |
   +---+     +---+     +---+     +---+---+
3. Wait very long time in io_uring_enter(GETEVENTS) waiting for CQ ring
   completion...
  PENDING      SQ     INFLIGHT       CQ
   +---+     +---+     +---+     +---+---+
   | X |     | X |     |   |====>| X |   |
   +---+     +---+     +---+     +---+---+
   ...and still there is no guarantee that slot in SQ ring became
   available. Should we call
       io_uring_enter(GETEVENTS, min_complete = 1);
   in a loop, checking (*khead == *ktail) at every iteration?

Concluding, it seems reasonable to instruct applications using SQPOLL to
submit SQEs until the queue is full, and then call io_uring_enter(),
probably with some flag, to wait for a slot in submission queue, not for
completions, since
1) Time needed to complete IO tends to be much greater than time needed
   to submit it.
2) CQ ring completion does not imply the slot became available in SQ (see
   diagram above).
3) Busy waiting of submitting thread is probably not what is desired by
   SQPOLL users.

Side note: eventloop-driven applications could find themselves comforted
by epoll()-ing ioring fd with EPOLLOUT to wait for the available entry in
SQ. Do I understand it correctly that spurious wakeups are currently
possible since io_uring_poll() is awakened only on io_commit_cqring(),
which, as shown above, doesn't guarantee that EPOLLOUT may be set?

Thank you again!
__
Best regards,
Filipp Mikoian