On Wed, Jan 16, 2019 at 02:20:53PM -0700, Jens Axboe wrote: > On 1/16/19 1:53 PM, Dave Chinner wrote: > > On Wed, Jan 16, 2019 at 10:50:00AM -0700, Jens Axboe wrote: > >> If we have fixed user buffers, we can map them into the kernel when we > >> setup the io_context. That avoids the need to do get_user_pages() for > >> each and every IO. > > ..... > >> + return -ENOMEM; > >> + } while (atomic_long_cmpxchg(&ctx->user->locked_vm, cur_pages, > >> + new_pages) != cur_pages); > >> + > >> + return 0; > >> +} > >> + > >> +static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx) > >> +{ > >> + int i, j; > >> + > >> + if (!ctx->user_bufs) > >> + return -EINVAL; > >> + > >> + for (i = 0; i < ctx->sq_entries; i++) { > >> + struct io_mapped_ubuf *imu = &ctx->user_bufs[i]; > >> + > >> + for (j = 0; j < imu->nr_bvecs; j++) { > >> + set_page_dirty_lock(imu->bvec[j].bv_page); > >> + put_page(imu->bvec[j].bv_page); > >> + } > > > > Hmmm, so we call set_page_dirty() when the gup reference is dropped... > > > > ..... > > > >> +static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, > >> + unsigned nr_args) > >> +{ > > > > ..... > > > >> + down_write(¤t->mm->mmap_sem); > >> + pret = get_user_pages_longterm(ubuf, nr_pages, FOLL_WRITE, > >> + pages, NULL); > >> + up_write(¤t->mm->mmap_sem); > > > > Thought so. This has the same problem as RDMA w.r.t. using > > file-backed mappings for the user buffer. It is not synchronised > > against truncate, hole punches, async page writeback cleaning the > > page, etc, and so can lead to data corruption and/or kernel panics. > > > > It also can't be used with DAX because the above problems are > > actually a user-after-free of storage space, not just a dangling > > page reference that can be cleaned up after the gup pin is dropped. > > > > Perhaps, at least until we solve the GUP problems w.r.t. file backed > > pages and/or add and require file layout leases for these reference, > > we should error out if the user buffer pages are file-backed > > mappings? > > Thanks for taking a look at this. > > I'd be fine with that restriction, especially since it can get relaxed > down the line. Do we have an appropriate API for this? And why isn't > get_user_pages_longterm() that exact API already? get_user_pages_longterm() is the right thing to use to ensure DAX doesn't trip over this - it's effectively just get_user_pages() with a "if (vma_is_fsdax(vma))" check in it to abort and return -EOPNOTSUPP. IOWs, this is safe on DAX but it's not safe on anything else. :/ Unfortunately, disallowing userspace GUP pins on non-DAX file backed pages will break existing "mostly just work" userspace apps all over the place. And so right now there are discussions ongoing about how to map gup references avoid the writeback races and be able to be seen/tracked by other kernel infrastructure (see the long, long thread "[PATCH 0/2] put_user_page*(): start converting the call sites" on -fsdevel). Progress is slow, but I think we're starting to close on a workable solution. FWIW, this doesn't solve the "long term user pin will block filesystem operations until unpin" problem, that's what moving to using revocable file layout leases is intended to solve. There have been patches posted some time ago to add this user API for this, but we've got to solve the other problems first.... > Would seem that most > (all?) callers of this API is currently broken then. Yup, there's a long, long history of machines using userspace RDMA panicing because filesystems have detected or tripped over invalid page cache state during writeback attempts. This is not a new problem.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx