Thanks for the detailed answer, Mike. Things are clearer in regard to your intention. > On Dec 6, 2020, at 1:37 AM, Mike Rapoport <rppt@xxxxxxxxxxxxx> wrote: > > The uffd monotor should *know* what is the state of child's memory and > without this patch it could only guess. I see - so mmap_changing is not just about graceful handling of copy-ioctl’s (which I think monitors could handle before mmap_changing was introduced) but to allow the monitor to know which pages are mapped in each process. Makes sense, but I have strong doubts it really works (see below). >> 2. How is memory ordering supposed to work here? IIUC, mmap_changing is not >> protected by any lock and there are no memory barriers that are associated >> with the assignment. Indeed, the code calls WRITE_ONCE()/READ_ONCE(), but >> AFAIK this does not guarantee ordering with non-volatile reads/writes. > > There is also mmap_lock involved, so I don't see how copy can start in > parallel with fork processing. Fork sets mmap_chaning to true while > holding mmap_lock, so copy cannot start in parallel. When mmap_lock is > realeased, mmap_chaning remains true until fork event is pushed to > userspace and when this is done there is no issue with > userfaultfd_copy. Whenever I run into a non-standard and non-trivial synchronization algorithm in the kernel (and elsewhere), I become very confused and concerned. I raised my question since I wanted to modify the code and could not figure out how to properly do so. Based on your input that the monitor is expected to know the child mappings according to userfaultfd events, I now think that the kernel does not provide this ability and the locking scheme is broken. Here are some scenarios that I think are broken - please correct me if I am wrong: * Scenario 1: MADV_DONTNEED racing with userfaultfd page-faults userfaultfd_remove() only holds the mmap_lock for read, so these events cannot be ordered with userfaultfd page-faults. * Scenario 2: MADV_DONTNEED racing with fork() As userfaultfd_remove() releases mmap_lock after the user notification and before the actual unmapping, concurrent fork() might happen before or after the actual unmapping in MADV_DONTNEED and the user therefore has no way of knowing whether the actual unmapping took place before or after the fork(). * Scenario 3: Concurrent MADV_DONTNEED can cause userfaultfd_remove() to clear mmap_changing cleared before all the notifications are completed. As mmap_lock is only taken for read, the first thread the completed userfaultfd_remove() would clear the indication that was set by the other one. * Scenario 4: Fork starts and ends between copying of two pages. As mmap_lock might be released during ioctl_copy() (inside __mcopy_atomic()), some pages might be mapped in the child and others not: CPU0 CPU1 ---- ---- ioctl_copy(): __mcopy_atomic() mmap_read_lock() !mmap_changing [ok] mfill_atomic_pte() == 0 [page0 copied] mfill_atomic_pte() == -ENOENT [page1 will be retried] mmap_read_unlock() goto retry fork(): dup_userfaultfd() -> mmap_changing=true userfaultfd_event_wait_completion() -> mmap_changing=false mmap_read_lock() !mmap_changing [ok] mfill_atomic_pte() == 0 [page1 copied] mmap_read_unlock() return: 2 pages were mapped, while the first is present in the child and the second one is non-present. Bottom-line: it seems to me that mmap_changing should be a counter (not boolean) that is protected by mmap_lock. This counter should be kept elevated throughout the entire operation (in regard to MADV_DONTNEED). Perhaps mmap_lock does not have to be taken to decrease the counter, but then an smp_wmb() would be needed before the counter is decreased. Let me know whether I am completely off or missing something. Thanks, Nadav