Re: [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings

Axel Rasmussen <axelrasmussen@xxxxxxxxxx> · Fri, 11 Feb 2022 11:08:14 -0800

On Thu, Feb 10, 2022 at 6:29 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> On Thu, Feb 10, 2022 at 01:36:57PM -0800, Mike Kravetz wrote:
> > > Another use case of DONTNEED upon hugetlbfs could be uffd-minor, because afaiu
> > > this is the only api that can force strip the hugetlb mapped pgtable without
> > > losing pagecache data.
> >
> > Correct.  However, I do not know if uffd-minor users would ever want to
> > do this.  Perhaps?

I talked with some colleagues, and I didn't come up with any
production *requirement* for it, but it may be a convenience in some
cases (make certain code cleaner, e.g. not having to unmap-and-remap
to tear down page tables as Peter mentioned). I think Peter's
assessment below is right.

>
> My understanding is before this patch uffd-minor upon hugetlbfs requires the
> huge file to be mapped twice, one to populate the content, then we'll be able
> to trap MINOR faults via the other mapping.  Or we could munmap() the range and
> remap it again on the same file offset to drop the pgtables, I think. But that
> sounds tricky.  MINOR faults only works with pgtables dropped.
>
> With DONTNEED upon hugetlbfs we can rely on one single mapping of the file,
> because we can explicitly drop the pgtables of hugetlbfs files without any
> other tricks.
>
> However I have no real use case of it.  Initially I thought it could be useful
> for QEMU because QEMU migration routine is run with the same mm context with
> the hypervisor, so by default is doesn't have two mappings of the same guest
> memory.  If QEMU wants to leverage minor faults, DONTNEED could help.).
>
> However when I was measuring bitmap transfer (assuming that's what minor fault
> could help with qemu's postcopy) there some months ago I found it's not as slow
> as I thought at all..  Either I could have missed something, or we're facing
> different problems with what it is when uffd minor is firstly proposed by Axel.

Re: the bitmap, that matters most on machines with lots of RAM. For
example, GCE offers some VMs with up to 12 *TB* of RAM
(https://cloud.google.com/compute/docs/memory-optimized-machines), I
think with this size of machine we see a significant benefit, as it
may take some significant time for the bitmap to arrive over the
network.

But I think that's a bit of an edge case, most machines are not that
big. :) I think the benefit is more often seen just in avoiding
copies. E.g. if we find a page is already up-to-date after precopy, we
just install PTEs, no copying or page allocation needed. And even when
we have to go fetch a page over the network, one can imagine an RDMA
setup where we can avoid any copies/allocations at all even in that
case. I suppose this also has a bigger effect on larger machines, e.g.
ones that are backed by 1G pages instead of 4k.

>
> This is probably too out of topic, though..  Let me go back..
>
> Said that, one thing I'm not sure about DONTNEED on hugetlb is whether this
> could further abuse DONTNEED, as the original POSIX definition is as simple as:
>
>   The application expects that it will not access the specified address range
>   in the near future.
>
> Linux did it by tearing down pgtable, which looks okay so far.  It could be a
> bit more weird to apply it to hugetlbfs because from its definition it's a hint
> to page reclaims, however hugetlbfs is not a target of page reclaim, neither is
> it LRU-aware.  It goes further into some MADV_ZAP styled syscall.
>
> I think it could still be fine as posix doesn't define that behavior
> specifically on hugetlb so it can be defined by Linux, but not sure whether
> there can be other implications.
>
> Thanks,
>
> --
> Peter Xu
>