On 10/24/24 2:08 PM, Jann Horn wrote: > On Thu, Oct 24, 2024 at 9:59?PM Jens Axboe <axboe@xxxxxxxxx> wrote: >> On 10/24/24 1:53 PM, Jann Horn wrote: >>> On Thu, Oct 24, 2024 at 9:50?PM Jens Axboe <axboe@xxxxxxxxx> wrote: >>>> On 10/24/24 12:13 PM, Jann Horn wrote: >>>>> On Thu, Oct 24, 2024 at 7:08?PM Jens Axboe <axboe@xxxxxxxxx> wrote: >>>>>> Add IORING_REGISTER_RESIZE_RINGS, which allows an application to resize >>>>>> the existing rings. It takes a struct io_uring_params argument, the same >>>>>> one which is used to setup the ring initially, and resizes rings >>>>>> according to the sizes given. >>>>> [...] >>>>>> + * We'll do the swap. Clear out existing mappings to prevent mmap >>>>>> + * from seeing them, as we'll unmap them. Any attempt to mmap existing >>>>>> + * rings beyond this point will fail. Not that it could proceed at this >>>>>> + * point anyway, as we'll hold the mmap_sem until we've done the swap. >>>>>> + * Likewise, hold the completion * lock over the duration of the actual >>>>>> + * swap. >>>>>> + */ >>>>>> + mmap_write_lock(current->mm); >>>>> >>>>> Why does the mmap lock for current->mm suffice here? I see nothing in >>>>> io_uring_mmap() that limits mmap() to tasks with the same mm_struct. >>>> >>>> Ehm does ->mmap() not hold ->mmap_sem already? I was under that >>>> understanding. Obviously if it doesn't, then yeah this won't be enough. >>>> Checked, and it does. >>>> >>>> Ah I see what you mean now, task with different mm. But how would that >>>> come about? The io_uring fd is CLOEXEC, and it can't get passed. >>> >>> Yeah, that's what I meant, tasks with different mm. I think there are >>> a few ways to get the io_uring fd into a different task, the ones I >>> can immediately think of: >>> >>> - O_CLOEXEC only applies on execve(), fork() should still inherit the fd >>> - O_CLOEXEC can be cleared via fcntl() >>> - you can use clone() to create two tasks that share FD tables >>> without sharing an mm >> >> OK good catch, yes then it won't be enough. Might just make sense to >> exclude mmap separately, then. Thanks, I'll work on that for v4! > > Yeah, that sounds reasonable to me. Something like this should do it, it's really just replacing mmap_sem with a ring private lock. And since the ordering already had to deal with uring_lock vs mmap_sem ABBA issues, this should slot straight in as well. Totally untested, will write a fork() test case as well for the resize-rings test cases. diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index 9746f4bb5182..2f12828b22a4 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -423,6 +423,13 @@ struct io_ring_ctx { /* protected by ->completion_lock */ unsigned evfd_last_cq_tail; + /* + * Protection for resize vs mmap races - both the mmap and resize + * side will need to grab this lock, to prevent either side from + * being run concurrently with the other. + */ + struct mutex resize_lock; + /* * If IORING_SETUP_NO_MMAP is used, then the below holds * the gup'ed pages for the two rings, and the sqes. diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index d0b6ec8039cb..2863b957e373 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -353,6 +353,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) INIT_WQ_LIST(&ctx->submit_state.compl_reqs); INIT_HLIST_HEAD(&ctx->cancelable_uring_cmd); io_napi_init(ctx); + mutex_init(&ctx->resize_lock); return ctx; diff --git a/io_uring/memmap.c b/io_uring/memmap.c index d614824e17bd..fc777de2c229 100644 --- a/io_uring/memmap.c +++ b/io_uring/memmap.c @@ -251,6 +251,8 @@ __cold int io_uring_mmap(struct file *file, struct vm_area_struct *vma) unsigned int npages; void *ptr; + guard(mutex)(&ctx->resize_lock); + ptr = io_uring_validate_mmap_request(file, vma->vm_pgoff, sz); if (IS_ERR(ptr)) return PTR_ERR(ptr); @@ -274,6 +276,7 @@ unsigned long io_uring_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) { + struct io_ring_ctx *ctx = filp->private_data; void *ptr; /* @@ -284,6 +287,8 @@ unsigned long io_uring_get_unmapped_area(struct file *filp, unsigned long addr, if (addr) return -EINVAL; + guard(mutex)(&ctx->resize_lock); + ptr = io_uring_validate_mmap_request(filp, pgoff, len); if (IS_ERR(ptr)) return -ENOMEM; @@ -329,8 +334,11 @@ unsigned long io_uring_get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) { + struct io_uring_ctx *ctx = file->private_data; void *ptr; + guard(mutex)(&ctx->resize_lock); + ptr = io_uring_validate_mmap_request(file, pgoff, len); if (IS_ERR(ptr)) return PTR_ERR(ptr); diff --git a/io_uring/register.c b/io_uring/register.c index 0f6b18b3d3d1..1eb686eaa310 100644 --- a/io_uring/register.c +++ b/io_uring/register.c @@ -486,14 +486,15 @@ static int io_register_resize_rings(struct io_ring_ctx *ctx, void __user *arg) } /* - * We'll do the swap. Clear out existing mappings to prevent mmap - * from seeing them, as we'll unmap them. Any attempt to mmap existing - * rings beyond this point will fail. Not that it could proceed at this - * point anyway, as we'll hold the mmap_sem until we've done the swap. - * Likewise, hold the completion * lock over the duration of the actual - * swap. + * We'll do the swap. Grab the ctx->resize_lock, which will exclude + * any new mmap's on the ring fd. Clear out existing mappings to prevent + * mmap from seeing them, as we'll unmap them. Any attempt to mmap + * existing rings beyond this point will fail. Not that it could proceed + * at this point anyway, as the io_uring mmap side needs go grab the + * ctx->resize_lock as well. Likewise, hold the completion lock over the + * duration of the actual swap. */ - mmap_write_lock(current->mm); + mutex_lock(&ctx->resize_lock); spin_lock(&ctx->completion_lock); o.rings = ctx->rings; ctx->rings = NULL; @@ -560,7 +561,7 @@ static int io_register_resize_rings(struct io_ring_ctx *ctx, void __user *arg) ret = 0; out: spin_unlock(&ctx->completion_lock); - mmap_write_unlock(current->mm); + mutex_unlock(&ctx->resize_lock); io_register_free_rings(&p, to_free); if (ctx->sq_data) -- Jens Axboe