Re: IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL

Avi Kivity <avi@xxxxxxxxxxxx> · Wed, 15 Jun 2022 13:12:23 +0300

On 19/04/2022 20.14, Jens Axboe wrote:
On 4/19/22 9:21 AM, Jens Axboe wrote:
On 4/19/22 6:31 AM, Jens Axboe wrote:
On 4/19/22 6:21 AM, Avi Kivity wrote:
On 19/04/2022 15.04, Jens Axboe wrote:
On 4/19/22 5:57 AM, Avi Kivity wrote:
On 19/04/2022 14.38, Jens Axboe wrote:
On 4/19/22 5:07 AM, Avi Kivity wrote:
A simple webserver shows about 5% loss compared to linux-aio.


I expect the loss is due to an optimization that io_uring lacks -
inline completion vs workqueue completion:
I don't think that's it, io_uring never punts to a workqueue for
completions.
I measured this:



   Performance counter stats for 'system wide':

           1,273,756 io_uring:io_uring_task_add

        12.288597765 seconds time elapsed

Which exactly matches with the number of requests sent. If that's the
wrong counter to measure, I'm happy to try again with the correct
counter.
io_uring_task_add() isn't a workqueue, it's task_work. So that is
expected.
Might actually be implicated. Not because it's a async worker, but
because I think we might be losing some affinity in this case. Looking
at traces, we're definitely bouncing between the poll completion side
and then execution the completion.

Can you try this hack? It's against -git + for-5.19/io_uring. If you let
me know what base you prefer, I can do a version against that. I see
about a 3% win with io_uring with this, and was slower before against
linux-aio as you saw as well.
Another thing to try - get rid of the IPI for TWA_SIGNAL, which I
believe may be the underlying cause of it.


Resurrecting an old thread. I have a question about timeliness of 
completions. Let's assume a request has completed. From the patch, it 
appears that io_uring will only guarantee that a completion appears on 
the completion ring if the thread has entered kernel mode since the 
completion happened. So user-space polling of the completion ring can 
cause unbounded delays.


If this is correct (it's not unreasonable, but should be documented), 
then there should also be a simple way to force a kernel entry. But how 
to do this using liburing? IIUC if I the following apply:


 1. I have no pending sqes

 2. There are pending completions

 3. There is a completed event for which a completion has not been 
appended to the completion queue ring


Then io_uring_wait_cqe() will elide io_uring_enter() and the 
completed-but-not-reported event will be delayed.

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 32aeb2c581c5..59987dd212d8 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -871,7 +871,7 @@ static bool io_wq_for_each_worker(struct io_wqe *wqe,
  
  static bool io_wq_worker_wake(struct io_worker *worker, void *data)
  {
-	set_notify_signal(worker->task);
+	set_notify_signal(worker->task, true);
  	wake_up_process(worker->task);
  	return false;
  }
@@ -991,7 +991,7 @@ static bool __io_wq_worker_cancel(struct io_worker *worker,
  {
  	if (work && match->fn(work, match->data)) {
  		work->flags |= IO_WQ_WORK_CANCEL;
-		set_notify_signal(worker->task);
+		set_notify_signal(worker->task, true);
  		return true;
  	}
  
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 3c8b34876744..ac1f14973e09 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -359,10 +359,10 @@ static inline void clear_notify_signal(void)
   * Called to break out of interruptible wait loops, and enter the
   * exit_to_user_mode_loop().
   */
-static inline void set_notify_signal(struct task_struct *task)
+static inline void set_notify_signal(struct task_struct *task, bool need_ipi)
  {
  	if (!test_and_set_tsk_thread_flag(task, TIF_NOTIFY_SIGNAL) &&
-	    !wake_up_state(task, TASK_INTERRUPTIBLE))
+	    !wake_up_state(task, TASK_INTERRUPTIBLE) && need_ipi)
  		kick_process(task);
  }
  
diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index 5d03a2ad1066..bff53f539933 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -367,7 +367,7 @@ static void klp_send_signals(void)
  			 * Send fake signal to all non-kthread tasks which are
  			 * still not migrated.
  			 */
-			set_notify_signal(task);
+			set_notify_signal(task, true);
  		}
  	}
  	read_unlock(&tasklist_lock);
diff --git a/kernel/task_work.c b/kernel/task_work.c
index c59e1a49bc40..47d7024dc499 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -51,7 +51,7 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
  		set_notify_resume(task);
  		break;
  	case TWA_SIGNAL:
-		set_notify_signal(task);
+		set_notify_signal(task, false);
  		break;
  	default:
  		WARN_ON_ONCE(1);