Re: [PATCH v2 2/3] userfaultfd: protect mmap_changing with rw_sem in userfaulfd_ctx

Lokesh Gidra <lokeshgidra@xxxxxxxxxx> · Wed, 7 Feb 2024 12:24:52 -0800

On Wed, Feb 7, 2024 at 7:27 AM Mike Rapoport <rppt@xxxxxxxxxx> wrote:
>
> On Mon, Feb 05, 2024 at 12:53:33PM -0800, Lokesh Gidra wrote:
> > On Sun, Feb 4, 2024 at 2:27 AM Mike Rapoport <rppt@xxxxxxxxxx> wrote:
> > >
> > > > 3) Based on [1] I see how mmap_changing helps in eliminating duplicate
> > > > work (background copy) by uffd monitor, but didn't get if there is a
> > > > correctness aspect too that I'm missing? I concur with Amit's point in
> > > > [1] that getting -EEXIST when setting up the pte will avoid memory
> > > > corruption, no?
> > >
> > > In the fork case without mmap_changing the child process may be get data or
> > > zeroes depending on the race for mmap_lock between the fork and
> > > uffdio_copy and -EEXIST is not enough for monitor to detect what was the
> > > ordering between fork and uffdio_copy.
> >
> > This is extremely helpful. IIUC, there is a window after mmap_lock
> > (write-mode) is released and before the uffd monitor thread is
> > notified of fork. In that window, the monitor doesn't know that fork
> > has already happened. So, without mmap_changing it would have done
> > background copy only in the parent, thereby causing data inconsistency
> > between parent and child processes.
>
> Yes.
>
> > It seems to me that the correctness argument for mmap_changing is
> > there in case of FORK event and REMAP when mremap is called with
> > MREMAP_DONTUNMAP. In all other cases its only benefit is by avoiding
> > unnecessary background copies, right?
>
> Yes, I think you are right, but it's possible I've forgot some nasty race
> that will need mmap_changing for other events.
>
> > > > > > > > > > @@ -783,7 +788,9 @@ bool userfaultfd_remove(struct vm_area_struct *vma,
> > > > > > > > > >               return true;
> > > > > > > > > >
> > > > > > > > > >       userfaultfd_ctx_get(ctx);
> > > > > > > > > > +     down_write(&ctx->map_changing_lock);
> > > > > > > > > >       atomic_inc(&ctx->mmap_changing);
> > > > > > > > > > +     up_write(&ctx->map_changing_lock);
> > > > > > > > > >       mmap_read_unlock(mm);
> > > > > > > > > >
> > > > > > > > > >       msg_init(&ewq.msg);
> > > > > > >
> > > > > > > If this happens in read mode, then why are you waiting for the readers
> > > > > > > to leave?  Can't you just increment the atomic?  It's fine happening in
> > > > > > > read mode today, so it should be fine with this new rwsem.
> > > > > >
> > > > > > It's been a while and the details are blurred now, but if I remember
> > > > > > correctly, having this in read mode forced non-cooperative uffd monitor to
> > > > > > be single threaded. If a monitor runs, say uffdio_copy, and in parallel a
> > > > > > thread in the monitored process does MADV_DONTNEED, the latter will wait
> > > > > > for userfaultfd_remove notification to be processed in the monitor and drop
> > > > > > the VMA contents only afterwards. If a non-cooperative monitor would
> > > > > > process notification in parallel with uffdio ops, MADV_DONTNEED could
> > > > > > continue and race with uffdio_copy, so read mode wouldn't be enough.
> > > > > >
> > > > >
> > > > > Right now this function won't stop to wait for readers to exit the
> > > > > critical section, but with this change there will be a pause (since the
> > > > > down_write() will need to wait for the readers with the read lock).  So
> > > > > this is adding a delay in this call path that isn't necessary (?) nor
> > > > > existed before.  If you have non-cooperative uffd monitors, then you
> > > > > will have to wait for them to finish to mark the uffd as being removed,
> > > > > where as before it was a fire & forget, this is now a wait to tell.
> > > > >
> > > > I think a lot will be clearer once we get a response to my questions
> > > > above. IMHO not only this write-lock is needed here, we need to fix
> > > > userfaultfd_remove() by splitting it into userfaultfd_remove_prep()
> > > > and userfaultfd_remove_complete() (like all other non-cooperative
> > > > operations) as well. This patch enables us to do that as we remove
> > > > mmap_changing's dependency on mmap_lock for synchronization.
> > >
> > > The write-lock is not a requirement here for correctness and I don't see
> > > why we would need userfaultfd_remove_prep().
> > >
> > > As I've said earlier, having a write-lock here will let CRIU to run
> > > background copy in parallel with processing of uffd events, but I don't
> > > feel strongly about doing it.
> > >
> > Got it. Anyways, such a change needn't be part of this patch, so I'm
> > going to keep it unchanged.
>
> You mean with a read lock?

No, I think write lock is good as it enables parallel background copy.
Also because it brings consistency in blocking userfaultfd operations.

I meant encapsulating remove operations within
userfaultfd_remove_prep() and userfaultfd_remove_complete(). I
couldn't figure out any need for that.

>
> --
> Sincerely yours,
> Mike.