Re: [PATCH 4/4] io_uring/register: add IORING_REGISTER_RESIZE_RINGS

Jens Axboe <axboe@xxxxxxxxx> · Thu, 24 Oct 2024 14:25:50 -0600

On 10/24/24 2:08 PM, Jann Horn wrote:
> On Thu, Oct 24, 2024 at 9:59?PM Jens Axboe <axboe@xxxxxxxxx> wrote:
>> On 10/24/24 1:53 PM, Jann Horn wrote:
>>> On Thu, Oct 24, 2024 at 9:50?PM Jens Axboe <axboe@xxxxxxxxx> wrote:
>>>> On 10/24/24 12:13 PM, Jann Horn wrote:
>>>>> On Thu, Oct 24, 2024 at 7:08?PM Jens Axboe <axboe@xxxxxxxxx> wrote:
>>>>>> Add IORING_REGISTER_RESIZE_RINGS, which allows an application to resize
>>>>>> the existing rings. It takes a struct io_uring_params argument, the same
>>>>>> one which is used to setup the ring initially, and resizes rings
>>>>>> according to the sizes given.
>>>>> [...]
>>>>>> +        * We'll do the swap. Clear out existing mappings to prevent mmap
>>>>>> +        * from seeing them, as we'll unmap them. Any attempt to mmap existing
>>>>>> +        * rings beyond this point will fail. Not that it could proceed at this
>>>>>> +        * point anyway, as we'll hold the mmap_sem until we've done the swap.
>>>>>> +        * Likewise, hold the completion * lock over the duration of the actual
>>>>>> +        * swap.
>>>>>> +        */
>>>>>> +       mmap_write_lock(current->mm);
>>>>>
>>>>> Why does the mmap lock for current->mm suffice here? I see nothing in
>>>>> io_uring_mmap() that limits mmap() to tasks with the same mm_struct.
>>>>
>>>> Ehm does ->mmap() not hold ->mmap_sem already? I was under that
>>>> understanding. Obviously if it doesn't, then yeah this won't be enough.
>>>> Checked, and it does.
>>>>
>>>> Ah I see what you mean now, task with different mm. But how would that
>>>> come about? The io_uring fd is CLOEXEC, and it can't get passed.
>>>
>>> Yeah, that's what I meant, tasks with different mm. I think there are
>>> a few ways to get the io_uring fd into a different task, the ones I
>>> can immediately think of:
>>>
>>>  - O_CLOEXEC only applies on execve(), fork() should still inherit the fd
>>>  - O_CLOEXEC can be cleared via fcntl()
>>>  - you can use clone() to create two tasks that share FD tables
>>> without sharing an mm
>>
>> OK good catch, yes then it won't be enough. Might just make sense to
>> exclude mmap separately, then. Thanks, I'll work on that for v4!
> 
> Yeah, that sounds reasonable to me.

Something like this should do it, it's really just replacing mmap_sem
with a ring private lock. And since the ordering already had to deal
with uring_lock vs mmap_sem ABBA issues, this should slot straight in as
well.

Totally untested, will write a fork() test case as well for the
resize-rings test cases.

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 9746f4bb5182..2f12828b22a4 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -423,6 +423,13 @@ struct io_ring_ctx {
 	/* protected by ->completion_lock */
 	unsigned			evfd_last_cq_tail;
 
+	/*
+	 * Protection for resize vs mmap races - both the mmap and resize
+	 * side will need to grab this lock, to prevent either side from
+	 * being run concurrently with the other.
+	 */
+	struct mutex			resize_lock;
+
 	/*
 	 * If IORING_SETUP_NO_MMAP is used, then the below holds
 	 * the gup'ed pages for the two rings, and the sqes.
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index d0b6ec8039cb..2863b957e373 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -353,6 +353,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	INIT_WQ_LIST(&ctx->submit_state.compl_reqs);
 	INIT_HLIST_HEAD(&ctx->cancelable_uring_cmd);
 	io_napi_init(ctx);
+	mutex_init(&ctx->resize_lock);
 
 	return ctx;
 
diff --git a/io_uring/memmap.c b/io_uring/memmap.c
index d614824e17bd..fc777de2c229 100644
--- a/io_uring/memmap.c
+++ b/io_uring/memmap.c
@@ -251,6 +251,8 @@ __cold int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
 	unsigned int npages;
 	void *ptr;
 
+	guard(mutex)(&ctx->resize_lock);
+
 	ptr = io_uring_validate_mmap_request(file, vma->vm_pgoff, sz);
 	if (IS_ERR(ptr))
 		return PTR_ERR(ptr);
@@ -274,6 +276,7 @@ unsigned long io_uring_get_unmapped_area(struct file *filp, unsigned long addr,
 					 unsigned long len, unsigned long pgoff,
 					 unsigned long flags)
 {
+	struct io_ring_ctx *ctx = filp->private_data;
 	void *ptr;
 
 	/*
@@ -284,6 +287,8 @@ unsigned long io_uring_get_unmapped_area(struct file *filp, unsigned long addr,
 	if (addr)
 		return -EINVAL;
 
+	guard(mutex)(&ctx->resize_lock);
+
 	ptr = io_uring_validate_mmap_request(filp, pgoff, len);
 	if (IS_ERR(ptr))
 		return -ENOMEM;
@@ -329,8 +334,11 @@ unsigned long io_uring_get_unmapped_area(struct file *file, unsigned long addr,
 					 unsigned long len, unsigned long pgoff,
 					 unsigned long flags)
 {
+	struct io_uring_ctx *ctx = file->private_data;
 	void *ptr;
 
+	guard(mutex)(&ctx->resize_lock);
+
 	ptr = io_uring_validate_mmap_request(file, pgoff, len);
 	if (IS_ERR(ptr))
 		return PTR_ERR(ptr);
diff --git a/io_uring/register.c b/io_uring/register.c
index 0f6b18b3d3d1..1eb686eaa310 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -486,14 +486,15 @@ static int io_register_resize_rings(struct io_ring_ctx *ctx, void __user *arg)
 	}
 
 	/*
-	 * We'll do the swap. Clear out existing mappings to prevent mmap
-	 * from seeing them, as we'll unmap them. Any attempt to mmap existing
-	 * rings beyond this point will fail. Not that it could proceed at this
-	 * point anyway, as we'll hold the mmap_sem until we've done the swap.
-	 * Likewise, hold the completion * lock over the duration of the actual
-	 * swap.
+	 * We'll do the swap. Grab the ctx->resize_lock, which will exclude
+	 * any new mmap's on the ring fd. Clear out existing mappings to prevent
+	 * mmap from seeing them, as we'll unmap them. Any attempt to mmap
+	 * existing rings beyond this point will fail. Not that it could proceed
+	 * at this point anyway, as the io_uring mmap side needs go grab the
+	 * ctx->resize_lock as well. Likewise, hold the completion lock over the
+	 * duration of the actual swap.
 	 */
-	mmap_write_lock(current->mm);
+	mutex_lock(&ctx->resize_lock);
 	spin_lock(&ctx->completion_lock);
 	o.rings = ctx->rings;
 	ctx->rings = NULL;
@@ -560,7 +561,7 @@ static int io_register_resize_rings(struct io_ring_ctx *ctx, void __user *arg)
 	ret = 0;
 out:
 	spin_unlock(&ctx->completion_lock);
-	mmap_write_unlock(current->mm);
+	mutex_unlock(&ctx->resize_lock);
 	io_register_free_rings(&p, to_free);
 
 	if (ctx->sq_data)

-- 
Jens Axboe