Hi Hugh, On Tue, Dec 01, 2020 at 01:31:21PM -0800, Hugh Dickins wrote: > Please don't ever rely on that i_private business for correctness: as The out of order and lockless "if (inode->i_private)" branch didn't inspire much confidence in terms of being able to rely on it for locking correctness indeed. > more of the comment around "So refrain from" explains, it was added > to avoid a livelock with the trinity fuzzer, and I'd gladly delete > it all, if I could ever be confident that enough has changed in the > intervening years that it's no longer necessary. I was wondering if it's still needed, so now I checked the old code and it also did an unconditional lock_page in shmem_fault, so I assume it's still necessary. > It tends to prevent shmem faults racing hole punches in the same area > (without quite guaranteeing no race at all). But without the livelock > issue, I'd just have gone on letting them race. "Punch hole" == > "Lose data" - though I can imagine that UFFD would like more control > over that. Since map_pages is driven by faulting, it should already > be throttled too. Yes I see now it just needs to "eventually" stop the shmem_fault activity, it doesn't need to catch those faults already in flight, so it cannot relied upon as the form of serialization to zap the pageteables while truncating the page. > But Andrea in next mail goes on to see other issues with UFFD_WP > in relation to shmem swap, so this thread is probably dead now. > If there's a bit to spare for UFFD_WP in the anonymous swap entry, > then that bit should be usable for shmem too: but a shmem page > (and its swap entry) can be shared between many different users, > so I don't know whether that will make sense. An UFFD_WP software bit exists both for the present pte and for the swap entry: #define _PAGE_UFFD_WP (_AT(pteval_t, 1) << _PAGE_BIT_UFFD_WP) #define _PAGE_SWP_UFFD_WP _PAGE_USER It works similarly to soft dirty, if the page is swapped _PAGE_UFFD_WP it's translated to _PAGE_SWP_UFFD_WP and the other way around during swapin. The problem is that the bit represents an information that is specific to an mm and a single mapping. If you punch an hole in a shmem, you map the shmem file in two processes A and B, you register the mapped range in a uffd opened by process A using VM_UFFD_MISSING, only process A will generate userfaults if you access the virtual address that corresponds to the hole in the file, process B will still fill the hole normally and it won't trigger userfaults in process A. The uffd context is attached to an mm, and the notification only has effect on that mm, the mm that created the context. Only the UFFDIO_ operations that resolve the fault (like UFFDIO_COPY) have effect visible by all file sharers. So VM_UFFD_WP shall work the same, and the _PAGE_SWP_UFFD_WP information of a swapped out shmem page shouldn't go in the xarray because doing so would share it with all "mm" sharing the mapping. Currently uffd-wp only works on anon memory so this is a new challenge in extending uffd-wp to shmem it seems. If any pagetable of an anonymous memory mapping is zapped, then not just the _PAGE_SWP_UFFD_WP bit is lost, the whole data is lost too so it'd break not just uffd-wp. As opposed with shmem the ptes can be zapped at all times by memory pressure alone even before the final ->writepage swap out is invoked. It's always possible to make uffd-wp work without those bits (anon memory also would work without those bits), but the reason those bits exist is to avoid a flood of UFFDIO_WRITEPROTECT ioctl false-negatives after fork or after swapping (in the anon case). In the shmem case if we'd it without a bitflag to track which page was wrprotected in which vma, the uffd handler would be required to mark each pte writable with a syscall every time a new pte has been established on the range and there's a write to the page. One possibility could be to store the bit set by UFFDIO_WRITEPROTECT in a vmalloc bitmap associated with the shmem vma at uffd registration time, so it can be checked during the shmem_fault() if VM_UFFD_WP is set on the vma? The vma_splits and vma_merge would be the hard part. Another possibility would be to change how the pte invalidate work for vmas with VM_UFFD_WP set, the pte and the _PAGE_UFFD_WP would need to survive all invalidates... only the final truncation under page lock would be really allowed to clear the whole pte and destroy the _PAGE_UFFD_WP information in the pte and then to free the pgtables on those ranges. Once we find a way to make that bit survive the invalidates, we can check it in shmem_fault to decide if to invoke handle_userfault if the bit is set and FAULT_FLAG_WRITE is set in vmf->flags, or if to map the page wrprotected in that vaddr if the bit is set and FAULT_FLAG_WRITE is not set. Thanks, Andrea