On Mon, Apr 24, 2023 at 03:54:48PM -0300, Jason Gunthorpe wrote: > On Mon, Apr 24, 2023 at 07:22:03PM +0100, Lorenzo Stoakes wrote: > > > OK I guess you mean the folio lock :) Well there is > > unpin_user_pages_dirty_lock() and unpin_user_page_range_dirty_lock() and > > also set_page_dirty_lock() (used by __access_remote_vm()) which should > > avoid this. > > It has been a while, but IIRC, these are all basically racy, the > comment in front of set_page_dirty_lock() even says it is racy.. > > The race is that a FS cleans a page and thinks it cannot become dirty, > and then it becomes dirty - and all variations of that.. > > Looking around a bit, I suppose what I'd expect to see is a sequence > sort of like what do_page_mkwrite() does: > > /* Synchronize with the FS and get the page locked */ > ret = vmf->vma->vm_ops->page_mkwrite(vmf); > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) > return ret; > if (unlikely(!(ret & VM_FAULT_LOCKED))) { > lock_page(page); > if (!page->mapping) { > unlock_page(page); > return 0; /* retry */ > } > ret |= VM_FAULT_LOCKED; > } else > VM_BUG_ON_PAGE(!PageLocked(page), page); > > /* Write to the page with the CPU */ > va = kmap_local_atomic(page); > memcpy(va, ....); > kunmap_local_atomic(page); > > /* Tell the FS and unlock it. */ > set_page_dirty(page); > unlock_page(page); > > I don't know if this is is exactly right, but it seems closerish > > So maybe some kind of GUP interfaces that returns single locked pages > is the right direction? IDK > > Or maybe we just need to make a memcpy primitive that works while > holding the PTLs? > I think this patch suggestion has scope crept from 'incremental improvement' to 'major rework of GUP' at this point. Also surely you'd want to obtain the PTL of all mappings to a file? This seems really unworkable and I don't think holding a folio lock over a long period is sensible either. > > We definitely need to keep ptrace and /proc/$pid/mem functioning correctly, > > and I given the privilege levels required I don't think there's a security > > issue there? > > Even root is not allowed to trigger data corruption or oops inside the > kernel. > > Jason Of course, but isn't this supposed to be an incremental fix? It feels a little contradictory to want to introduce a flag intentionally to try to highlight brokenness then to not accept any solution that doesn't solve that brokenness. In any case, I feel that this patch isn't going to go anywhere as-is, it's insufficiently large to solve the problem as a whole (I think that's a bigger problem we can return to later), and there appears to be no taste for an incremental improvement, even from the suggester :) As a result, I suggest we take the cautious route in order to unstick the vmas patch series - introduce an OPT-IN flag which allows the check to be made, and update io_uring to use this. That way it defers the larger discussion around this improvement, avoids breaking anything, provides some basis in code for this check and is a net, incremental small and digestible improvement.