On Wed, Jan 16, 2019 at 03:21:21PM -0700, Jens Axboe wrote: > On 1/16/19 3:09 PM, Dave Chinner wrote: > > On Wed, Jan 16, 2019 at 02:20:53PM -0700, Jens Axboe wrote: > >> On 1/16/19 1:53 PM, Dave Chinner wrote: > >> I'd be fine with that restriction, especially since it can get relaxed > >> down the line. Do we have an appropriate API for this? And why isn't > >> get_user_pages_longterm() that exact API already? > > > > get_user_pages_longterm() is the right thing to use to ensure DAX > > doesn't trip over this - it's effectively just get_user_pages() > > with a "if (vma_is_fsdax(vma))" check in it to abort and return > > -EOPNOTSUPP. IOWs, this is safe on DAX but it's not safe on anything > > else. :/ > > > > Unfortunately, disallowing userspace GUP pins on non-DAX file backed > > pages will break existing "mostly just work" userspace apps all over > > the place. And so right now there are discussions ongoing about how > > to map gup references avoid the writeback races and be able to be > > seen/tracked by other kernel infrastructure (see the long, long > > thread "[PATCH 0/2] put_user_page*(): start converting the call > > sites" on -fsdevel). Progress is slow, but I think we're starting to > > close on a workable solution. > > > > FWIW, this doesn't solve the "long term user pin will block > > filesystem operations until unpin" problem, that's what moving to > > using revocable file layout leases is intended to solve. There have > > been patches posted some time ago to add this user API for this, but > > we've got to solve the other problems first.... > > > >> Would seem that most > >> (all?) callers of this API is currently broken then. > > > > Yup, there's a long, long history of machines using userspace RDMA > > panicing because filesystems have detected or tripped over invalid > > page cache state during writeback attempts. This is not a new > > problem.... > > Thanks for your detailed answer, Dave! I didn't see it before I sent > out the previous email. FWIW, I've updated the patch: > > http://git.kernel.dk/cgit/linux-block/commit/?h=io_uring&id=0c8f2299f8069af6b2fa8f99a10d81646d1237a7 > > Checks for file backed memory, fails the registration with EOPNOTSUPP > if the check fails. Doesn't it need to call put_pages() on all the pages picked up by get_user_pages_longterm() when it returns -EOPNOTSUPP? They haven't been mapped into the imu->bvec array yet, so AFAICT there's nothing to release the page references on teardown here. Also, not a vma expert here, but the vma array contents may only be valid while the mmap_sem is held - I think vmas can come and go after it has been dropped and so accessing vmas to check vma->vm_file after the mmap_sem has been dropped may be open to read-after-free races. > That should handle the issue on the io_uring side at least, and it's a > restriction that can always be relaxed/lifted, when appropriate solutions > to file backed buffers exists. Modulo the issue above, that works for me. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx