Re: [PATCH 1/3] userfaultfd/selftests: fix feature support detection

Axel Rasmussen <axelrasmussen@xxxxxxxxxx> · Wed, 22 Sep 2021 15:29:42 -0700

On Wed, Sep 22, 2021 at 2:52 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> On Wed, Sep 22, 2021 at 01:54:53PM -0700, Axel Rasmussen wrote:
> > On Wed, Sep 22, 2021 at 10:33 AM Peter Xu <peterx@xxxxxxxxxx> wrote:
> > >
> > > Hello, Axel,
> > >
> > > On Wed, Sep 22, 2021 at 10:04:03AM -0700, Axel Rasmussen wrote:
> > > > Thanks for discussing the design Peter. I have some ideas which might
> > > > make for a nicer v2; I'll massage the code a bit and see what I can
> > > > come up with.
> > >
> > > Sure thing.  Note again that as I don't have a strong opinion on that, feel
> > > free to keep it.  However if you provide v2, I'll read.
> > >
> > > [off-topic below]
> > >
> > > Another thing I probably have forgot but need your confirmation is, when you
> > > worked on uffd minor mode, did you explicitly disable thp, or is it allowed?
> >
> > I gave a more detailed answer in the other thread, but: currently it
> > is allowed, but this was a bug / oversight on my part. :) THP collapse
> > can break the guarantees minor fault registration is trying to
> > provide.
>
> I've replied there:
>
> https://lore.kernel.org/linux-mm/YUueOUfoamxOvEyO@t490s/
>
> We can try to keep the discussion unified there regarding this.
>
> > But there's another scenario: what if the collapse happened well
> > before registration happened?
>
> Maybe yes, but my understanding of the current uffd-minor scenario tells me
> that this is fine too.  Meanwhile I actually have another idea regarding minor
> mode, please continue reading.
>
> Firstly, let me try to re-cap on how minor mode is used in your production
> systems: I believe there should have two processes A and B, if A is the main
> process, B could be the migration process.  B migrates pages in the background,
> while A so far should have been stopped and never ran.  When we want to start
> A, we should register A with uffd-minor upon the whole range (note: I think so
> far A does not have any pgtable mapped within uffd-minor range).  Then any page
> access of A should kick B and asking "whether it is the latest page", if yes
> then UFFDIO_CONTINUE, if no then B modifies the page, plus UFFDIO_CONTINUE
> afterwards.  Am I right above?
>
> So if that's the case, then A should have no page table at all.
>
> Then, is that a problem if the shmem file that A maps contains huge thps?  I
> think no - because UFFDIO_CONTINUE will only install small pages.
>
> Let me know if I'm understanding it right above; I'll be happy to be corrected.

Right, except that our use case is even more similar to QEMU: the code
doing UFFDIO_CONTINUE / demand paging, and the code running the vCPUs,
are in the same process (same mm) - just different threads.

>
> Actually besides this scenario, I'm also thinking of another scenario of using
> minor fault in a single process - that's mostly what QEMU is doing right now,
> as QEMU has the vcpu threads and migration thread sharing a single mm/pgtable.
> So I think it'll be great to have a new madvise(MADV_ZAP) which will tear down
> all the file-backed memory pgtables of a specific range.  I think it'll suite
> perfectly for the minor fault use case, and it can be used for other things
> too.  Let me know what you think about this idea, and whether that'll help in
> your case too (e.g., if you worry a current process A mapped huge shmem thp
> somewhere, we can use madvise(MADV_ZAP) to drop it).

Yes, this would be convenient for our implementation too. :) There are
workarounds if the feature doesn't exist, but it would be nice to
have. It's also useful for memory poisoning, I think, if the host
decides some page(s) are "bad" and wants to intercept any future guest
accesses to those page(s).

>
> > I *think* the existing code deals with THPs correctly in that case, but then
> > again I don't think our selftest really covers this case, and it's not
> > something I've tested in production either (to work around the other bug, we
> > currently MADV_NOHUGEPAGE the area until after VM demand paging completes,
> > and the UFFD registration is removed), so I am not super confident this is
> > the case.
>
> In all cases, enhancing the test program will always be welcomed.
>
> Thanks,
>
> --
> Peter Xu
>