Re: [PATCH v2 13/13] io_uring: support buffer registration sharing

Bijan Mottahedeh <bijan.mottahedeh@xxxxxxxxxx> · Wed, 6 Jan 2021 16:50:50 -0800




On 12/18/2020 10:06 AM, Bijan Mottahedeh wrote:

@@ -8415,6 +8421,12 @@ static int io_sqe_buffers_unregister(struct 
io_ring_ctx *ctx)
      if (!data)
          return -ENXIO;
+    if (ctx->flags & IORING_SETUP_ATTACH_BUF) {
+        io_detach_buf_data(ctx);
+        ctx->nr_user_bufs = 0;

nr_user_bufs is a part of invariant and should stay together with
stuff in io_detach_buf_data().

Moved to io_detach_buf_data.


@@ -8724,9 +8740,17 @@ static int io_sqe_buffers_register(struct 
io_ring_ctx *ctx, void __user *arg,
      struct fixed_rsrc_ref_node *ref_node;
      struct fixed_rsrc_data *buf_data;
+    if (ctx->flags & IORING_SETUP_ATTACH_BUF) {
+        if (!ctx->buf_data)
+            return -EFAULT;
+        ctx->nr_user_bufs = ctx->buf_data->ctx->nr_user_bufs;

Why? Once a table is initialised it shouldn't change its size, would
be racy otherwise.

ctx->buf_data is set at ring setup time but the sharing process 
(SETUP_SHARE) may do the actual buffer registration at an arbitrary time 
later, so the attaching process must ensure to get the updated value of 
nr_user_bufs if available.

      buf_data = io_buffers_map_alloc(ctx, nr_args);
      if (IS_ERR(buf_data))
          return PTR_ERR(buf_data);
+    ctx->buf_data = buf_data;

Wanted to write that there is missing
`if (ctx->user_bufs) return -EBUSY`

but apparently it was moved into io_buffers_map_alloc().
I'd really prefer to have it here.

Moved it back.

+static int io_attach_buf_data(struct io_ring_ctx *ctx,
+                  struct io_uring_params *p)
+{
+    struct io_ring_ctx *ctx_attach;
+    struct fd f;
+
+    f = fdget(p->wq_fd);
+    if (!f.file)
+        return -EBADF;
+    if (f.file->f_op != &io_uring_fops) {
+        fdput(f);
+        return -EINVAL;
+    }
+
+    ctx_attach = f.file->private_data;
+    if (!ctx_attach->buf_data) {

It looks racy. What prevents it from being deleted while we're
working on it, e.g. by io_sqe_buffers_unregister?

I think the premise here is that buffer sharing happens between trusted 
and coordinated processes.  If I understand your concern correctly, then 
if the sharing process unregisters its buffers after having shared them, 
than that process is acting improperly.  The race could lead to a failed 
attach but that would be expected and reasonable I would think?  What do 
you think should happen in this case?


+        fdput(f);
+        return -EINVAL;
+    }
+    ctx->buf_data = ctx_attach->buf_data;

Before updates, etc. (e.g. __io_sqe_buffers_update()) were synchronised
by uring_lock, now it's modified concurrently, that looks to be really
racy.

Racy from the attaching process perspective you mean?


+
+    percpu_ref_get(&ctx->buf_data->refs);

Ok, now the original io_uring instance will wait until the attached
once get rid of their references. That's a versatile ground to have
in kernel deadlocks.

task1: uring1 = create()
task2: uring2 = create()
task1: uring3 = create(share=uring2);
task2: uring4 = create(share=uring1);

task1: io_sqe_buffers_unregister(uring1)
task2: io_sqe_buffers_unregister(uring2)

If I skimmed through the code right, that should hang unkillably.

So we need a way to enforce that a process can only have one role, 
sharing or attaching? But I'm not what the best way to do that.  Is this 
an issue for other resource sharing, work queues or polling thread?


The intended use case for buffer registration is:

- a group of processes attach a shmem segment
- one process registers the buffers in the shmem segment and shares it
- other processes attach that registration

For this case, it seems that there is really no need to wait for the 
attached processes to get rid of the their references since the shmem 
segment (and thus the registered buffers) will persist anyway until the 
last attached process goes away.  So the last unregister could quiesce 
all references and get rid of the shared buf_data.

I'm not sure how useful the non-shmem use case would be anyway.

Would it makes sense to restrict the scope of this feature?