On Thu, Feb 10, 2022 at 6:29 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > On Thu, Feb 10, 2022 at 01:36:57PM -0800, Mike Kravetz wrote: > > > Another use case of DONTNEED upon hugetlbfs could be uffd-minor, because afaiu > > > this is the only api that can force strip the hugetlb mapped pgtable without > > > losing pagecache data. > > > > Correct. However, I do not know if uffd-minor users would ever want to > > do this. Perhaps? I talked with some colleagues, and I didn't come up with any production *requirement* for it, but it may be a convenience in some cases (make certain code cleaner, e.g. not having to unmap-and-remap to tear down page tables as Peter mentioned). I think Peter's assessment below is right. > > My understanding is before this patch uffd-minor upon hugetlbfs requires the > huge file to be mapped twice, one to populate the content, then we'll be able > to trap MINOR faults via the other mapping. Or we could munmap() the range and > remap it again on the same file offset to drop the pgtables, I think. But that > sounds tricky. MINOR faults only works with pgtables dropped. > > With DONTNEED upon hugetlbfs we can rely on one single mapping of the file, > because we can explicitly drop the pgtables of hugetlbfs files without any > other tricks. > > However I have no real use case of it. Initially I thought it could be useful > for QEMU because QEMU migration routine is run with the same mm context with > the hypervisor, so by default is doesn't have two mappings of the same guest > memory. If QEMU wants to leverage minor faults, DONTNEED could help.). > > However when I was measuring bitmap transfer (assuming that's what minor fault > could help with qemu's postcopy) there some months ago I found it's not as slow > as I thought at all.. Either I could have missed something, or we're facing > different problems with what it is when uffd minor is firstly proposed by Axel. Re: the bitmap, that matters most on machines with lots of RAM. For example, GCE offers some VMs with up to 12 *TB* of RAM (https://cloud.google.com/compute/docs/memory-optimized-machines), I think with this size of machine we see a significant benefit, as it may take some significant time for the bitmap to arrive over the network. But I think that's a bit of an edge case, most machines are not that big. :) I think the benefit is more often seen just in avoiding copies. E.g. if we find a page is already up-to-date after precopy, we just install PTEs, no copying or page allocation needed. And even when we have to go fetch a page over the network, one can imagine an RDMA setup where we can avoid any copies/allocations at all even in that case. I suppose this also has a bigger effect on larger machines, e.g. ones that are backed by 1G pages instead of 4k. > > This is probably too out of topic, though.. Let me go back.. > > Said that, one thing I'm not sure about DONTNEED on hugetlb is whether this > could further abuse DONTNEED, as the original POSIX definition is as simple as: > > The application expects that it will not access the specified address range > in the near future. > > Linux did it by tearing down pgtable, which looks okay so far. It could be a > bit more weird to apply it to hugetlbfs because from its definition it's a hint > to page reclaims, however hugetlbfs is not a target of page reclaim, neither is > it LRU-aware. It goes further into some MADV_ZAP styled syscall. > > I think it could still be fine as posix doesn't define that behavior > specifically on hugetlb so it can be defined by Linux, but not sure whether > there can be other implications. > > Thanks, > > -- > Peter Xu >