Re: [PATCH 2/2] io_uring: don't hold uring_lock when calling io_run_task_work*

Hao Xu <haoxu@xxxxxxxxxxxxxxxxx> · Fri, 5 Feb 2021 17:57:28 +0800

在 2021/2/4 下午11:26, Pavel Begunkov 写道:


On 04/02/2021 11:17, Pavel Begunkov wrote:
On 04/02/2021 03:25, Hao Xu wrote:
在 2021/2/4 上午12:45, Pavel Begunkov 写道:
On 03/02/2021 16:35, Pavel Begunkov wrote:
On 03/02/2021 14:57, Hao Xu wrote:
This is caused by calling io_run_task_work_sig() to do work under
uring_lock while the caller io_sqe_files_unregister() already held
uring_lock.
we need to check if uring_lock is held by us when doing unlock around
io_run_task_work_sig() since there are code paths down to that place
without uring_lock held.

1. we don't want to allow parallel io_sqe_files_unregister()s
happening, it's synchronised by uring_lock atm. Otherwise it's
buggy.
Here "since there are code paths down to that place without uring_lock held" I mean code path of io_ring_ctx_free().

I guess it's to the 1/2, but let me outline the problem again:
if you have two tasks userspace threads sharing a ring, then they
can both and in parallel call syscall:files_unregeister. That's
a potential double percpu_ref_kill(&data->refs), or even worse.

Same for 2, but racing for the table and refs.

There is a couple of thoughts for this:

1. I don't like waiting without holding the lock in general, because
someone can submit more reqs in-between and so indefinitely postponing
the files_unregister.
Thanks, Pavel.
I thought this issue before, until I saw this in __io_uring_register:

  if (io_register_op_must_quiesce(opcode)) {
          percpu_ref_kill(&ctx->refs);

          /*
          ¦* Drop uring mutex before waiting for references to exit. If
          ¦* another thread is currently inside io_uring_enter() it might
          ¦* need to grab the uring_lock to make progress. If we hold it
          ¦* here across the drain wait, then we can deadlock. It's safe
          ¦* to drop the mutex here, since no new references will come in
          ¦* after we've killed the percpu ref.
          ¦*/
          mutex_unlock(&ctx->uring_lock);
          do {
                  ret = wait_for_completion_interruptible(&ctx->ref_comp);
                  if (!ret)
                          break;
                  ret = io_run_task_work_sig();
                  if (ret < 0)
                          break;
          } while (1);

          mutex_lock(&ctx->uring_lock);

          if (ret) {
                  percpu_ref_resurrect(&ctx->refs);
                  goto out_quiesce;
          }
  }

So now I guess the postponement issue also exits in the above code since
there could be another thread submiting reqs to the shared ctx(or we can 
say uring fd).

2. I wouldn't want to add checks for that in submission path.

So, a solution I think about is to wait under the lock, If we need to
run task_works -- briefly drop the lock, run task_works and then do
all unregister all over again. Keep an eye on refs, e.g. probably
need to resurrect it.

Because we current task is busy nobody submits new requests on
its behalf, and so there can't be infinite number of in-task_work
reqs, and eventually it will just go wait/sleep forever (if not
signalled) under the mutex, so we can a kind of upper bound on
time.

Do you mean sleeping with timeout rather than just sleeping? I think 
this works, I'll work on this and think about the detail.
But before addressing this issue, Should I first send a patch to just 
fix the deadlock issue?

Thanks,
Hao


This one should be simple, alike to

if (percpu_refs_is_dying())
     return error; // fail *files_unregister();


2. probably same with unregister and submit.


Reported-by: Abaci <abaci@xxxxxxxxxxxxxxxxx>
Fixes: 1ffc54220c44 ("io_uring: fix io_sqe_files_unregister() hangs")
Signed-off-by: Hao Xu <haoxu@xxxxxxxxxxxxxxxxx>
---
   fs/io_uring.c | 19 +++++++++++++------
   1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index efb6d02fea6f..b093977713ee 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7362,18 +7362,25 @@ static int io_sqe_files_unregister(struct io_ring_ctx *ctx, bool locked)
         /* wait for all refs nodes to complete */
       flush_delayed_work(&ctx->file_put_work);
+    if (locked)
+        mutex_unlock(&ctx->uring_lock);
       do {
           ret = wait_for_completion_interruptible(&data->done);
           if (!ret)
               break;
           ret = io_run_task_work_sig();
-        if (ret < 0) {
-            percpu_ref_resurrect(&data->refs);
-            reinit_completion(&data->done);
-            io_sqe_files_set_node(data, backup_node);
-            return ret;
-        }
+        if (ret < 0)
+            break;
       } while (1);
+    if (locked)
+        mutex_lock(&ctx->uring_lock);
+
+    if (ret < 0) {
+        percpu_ref_resurrect(&data->refs);
+        reinit_completion(&data->done);
+        io_sqe_files_set_node(data, backup_node);
+        return ret;
+    }
         __io_sqe_files_unregister(ctx);
       nr_tables = DIV_ROUND_UP(ctx->nr_user_files, IORING_MAX_FILES_TABLE);