> On Dec 8, 2020, at 12:34 AM, Mike Rapoport <rppt@xxxxxxxxxxxxx> wrote: > > On Sun, Dec 06, 2020 at 08:31:39PM -0800, Nadav Amit wrote: >> Whenever I run into a non-standard and non-trivial synchronization algorithm >> in the kernel (and elsewhere), I become very confused and concerned. I >> raised my question since I wanted to modify the code and could not figure >> out how to properly do so. Based on your input that the monitor is expected >> to know the child mappings according to userfaultfd events, I now think that >> the kernel does not provide this ability and the locking scheme is broken. >> >> Here are some scenarios that I think are broken - please correct me if I am >> wrong: >> >> * Scenario 1: MADV_DONTNEED racing with userfaultfd page-faults >> >> userfaultfd_remove() only holds the mmap_lock for read, so these events >> cannot be ordered with userfaultfd page-faults. >> >> * Scenario 2: MADV_DONTNEED racing with fork() >> >> As userfaultfd_remove() releases mmap_lock after the user notification and >> before the actual unmapping, concurrent fork() might happen before or after >> the actual unmapping in MADV_DONTNEED and the user therefore has no way of >> knowing whether the actual unmapping took place before or after the fork(). >> >> * Scenario 3: Concurrent MADV_DONTNEED can cause userfaultfd_remove() to >> clear mmap_changing cleared before all the notifications are completed. >> >> As mmap_lock is only taken for read, the first thread the completed >> userfaultfd_remove() would clear the indication that was set by the other >> one. >> >> * Scenario 4: Fork starts and ends between copying of two pages. >> >> As mmap_lock might be released during ioctl_copy() (inside >> __mcopy_atomic()), some pages might be mapped in the child and others not: >> >> >> CPU0 CPU1 >> ---- ---- >> ioctl_copy(): >> __mcopy_atomic() >> mmap_read_lock() >> !mmap_changing [ok] >> mfill_atomic_pte() == 0 [page0 copied] >> mfill_atomic_pte() == -ENOENT [page1 will be retried] >> mmap_read_unlock() >> goto retry >> >> fork(): >> dup_userfaultfd() >> -> mmap_changing=true >> userfaultfd_event_wait_completion() >> -> mmap_changing=false >> >> mmap_read_lock() >> !mmap_changing [ok] >> mfill_atomic_pte() == 0 [page1 copied] >> mmap_read_unlock() >> >> return: 2 pages were mapped, while the first is present in the child and >> the second one is non-present. >> >> Bottom-line: it seems to me that mmap_changing should be a counter (not >> boolean) that is protected by mmap_lock. This counter should be kept >> elevated throughout the entire operation (in regard to MADV_DONTNEED). >> Perhaps mmap_lock does not have to be taken to decrease the counter, but >> then an smp_wmb() would be needed before the counter is decreased. >> >> Let me know whether I am completely off or missing something. > > I tried to remember what's going on there and wrap my head around your > examples. I'm not sure if userspace cannot workaround some of those, but > I can't say I can propose it right now. > > There is for sure userspace is helpless in Scenario 4, but I think it is > very unlikely that fork() will be fast enough to grab and release > mmap_lock while uffd_copy() waits for CPU to retry. > > I agree that a making mmap_changing a counter would be more robust > anyway. Thanks for confirming my suspicion. On a second thought, I think that a sequence lock would be required. I will work on a patch to resolve it in the next RFC of the related patch series I am working on. As for the race window size, as there are lock optimizations to prevent writers' starvation, I do not think the last scenario is completely far-fetched. Thanks again, Nadav