Re: [PATCH v3 2/3] io_uring: reduce frequent add_wait_queue() overhead for multi-shot poll request

Xiaoguang Wang <xiaoguang.wang@xxxxxxxxxxxxxxxxx> · Fri, 29 Oct 2021 22:34:12 +0800

On 10/29/21 14:37, Xiaoguang Wang wrote:
hi,

On 10/29/21 03:57, Xiaoguang Wang wrote:
hi,

On 10/25/21 06:38, Xiaoguang Wang wrote:
Run echo_server to evaluate io_uring's multi-shot poll 
performance, perf
shows that add_wait_queue() has obvious overhead. Intruduce a new 
state
'active' in io_poll_iocb to indicate whether io_poll_wake() 
should queue
a task_work. This new state will be set to true initially, be set 
to false
when starting to queue a task work, and be set to true again when 
a poll
cqe has been committed. One concern is that this method may lost 
waken-up
event, but seems it's ok.

   io_poll_wake                io_poll_task_func
t1                       |
t2                       | WRITE_ONCE(req->poll.active, true);
t3                       |
t4                       |    io_commit_cqring(ctx);
t5                       |
t6                       |

If waken-up events happens before or at t4, it's ok, user app 
will always
see a cqe. If waken-up events happens after t4 and IIUC, 
io_poll_wake()
will see the new req->poll.active value by using READ_ONCE().

Echo_server codes can be cloned from:
https://codeup.openanolis.cn/codeup/storage/io_uring-echo-server.git, 

branch is xiaoguangwang/io_uring_multishot.

Without this patch, the tps in our test environment is 284116, with
this patch, the tps is 287832, about 1.3% reqs improvement, which
is indeed in accord with the saved add_wait_queue() cost.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@xxxxxxxxxxxxxxxxx>
---
  fs/io_uring.c | 57 
+++++++++++++++++++++++++++++++++------------------------
  1 file changed, 33 insertions(+), 24 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 18af9bb9a4bc..e4c779dac953 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -481,6 +481,7 @@ struct io_poll_iocb {
      __poll_t            events;
      bool                done;
      bool                canceled;
+    bool                active;
      struct wait_queue_entry        wait;
  };
  @@ -5233,8 +5234,6 @@ static inline int __io_async_wake(struct 
io_kiocb *req, struct io_poll_iocb *pol
  {
      trace_io_uring_task_add(req->ctx, req->opcode, 
req->user_data, mask);
  -    list_del_init(&poll->wait.entry);
-

As I mentioned to Hao some time ago, we can't allow this function 
or in
particular io_req_task_work_add() to happen twice before the first
task work got executed, they use the same field in io_kiocb and those
will corrupt the tw list.

Looks that's what can happen here.
If I have understood scenario your described correctly, I think it 
won't happen :)
With this patch, if the first io_req_task_work_add() is issued, 
poll.active
will be set to false, then no new io_req_task_work_add() will be 
issued.
Only the first task_work installed by the first 
io_req_task_work_add() has
completed, poll.active will be set to true again.

Ah, I see now, the active dance is in io_poll_wake(). That won't work
with READ_ONCE/WRITE_ONCE though, you would need real atomics

The easiest race to explain is:

CPU1                        | CPU2
io_poll_wake                | io_poll_wake
if (p->active) return 0;    | if (p->active) return 0;
it's "if (!p->active) return 0;" in my patch :)

hah, yeah, email draft-coding

// p->active is false in both cases, continue
p->active = false;          | p->active = false;
task_work_add()             | task_work_add()
io_poll_wake() is called with poll->head->lock, so there will no 
concurrent
io_poll_wake() calls.

The problem is that the poll implementation is too wobbly, can't count
how many corner cases there are... 
Absolutely agree with you. In the process of developing fixed poll patch,
I have realized poll implementation is  not easy...

We can get to the point that your
patches are "proven" to work, but I'll be more on the cautious side as
the current state is hell over complicated.
OK, I understand your concerns. I'll check double poll codes again, not
quite familiar with it yet.

Regards,
Xiaoguang Wang