On 2/11/22 11:08, Axel Rasmussen wrote: > On Thu, Feb 10, 2022 at 6:29 PM Peter Xu <peterx@xxxxxxxxxx> wrote: >> >> On Thu, Feb 10, 2022 at 01:36:57PM -0800, Mike Kravetz wrote: >>>> Another use case of DONTNEED upon hugetlbfs could be uffd-minor, because afaiu >>>> this is the only api that can force strip the hugetlb mapped pgtable without >>>> losing pagecache data. >>> >>> Correct. However, I do not know if uffd-minor users would ever want to >>> do this. Perhaps? > > I talked with some colleagues, and I didn't come up with any > production *requirement* for it, but it may be a convenience in some > cases (make certain code cleaner, e.g. not having to unmap-and-remap > to tear down page tables as Peter mentioned). I think Peter's > assessment below is right. > >> >> My understanding is before this patch uffd-minor upon hugetlbfs requires the >> huge file to be mapped twice, one to populate the content, then we'll be able >> to trap MINOR faults via the other mapping. Or we could munmap() the range and >> remap it again on the same file offset to drop the pgtables, I think. But that >> sounds tricky. MINOR faults only works with pgtables dropped. >> >> With DONTNEED upon hugetlbfs we can rely on one single mapping of the file, >> because we can explicitly drop the pgtables of hugetlbfs files without any >> other tricks. >> >> However I have no real use case of it. Initially I thought it could be useful >> for QEMU because QEMU migration routine is run with the same mm context with >> the hypervisor, so by default is doesn't have two mappings of the same guest >> memory. If QEMU wants to leverage minor faults, DONTNEED could help.). >> >> However when I was measuring bitmap transfer (assuming that's what minor fault >> could help with qemu's postcopy) there some months ago I found it's not as slow >> as I thought at all.. Either I could have missed something, or we're facing >> different problems with what it is when uffd minor is firstly proposed by Axel. > > Re: the bitmap, that matters most on machines with lots of RAM. For > example, GCE offers some VMs with up to 12 *TB* of RAM > (https://cloud.google.com/compute/docs/memory-optimized-machines), I > think with this size of machine we see a significant benefit, as it > may take some significant time for the bitmap to arrive over the > network. > > But I think that's a bit of an edge case, most machines are not that > big. :) I think the benefit is more often seen just in avoiding > copies. E.g. if we find a page is already up-to-date after precopy, we > just install PTEs, no copying or page allocation needed. And even when > we have to go fetch a page over the network, one can imagine an RDMA > setup where we can avoid any copies/allocations at all even in that > case. I suppose this also has a bigger effect on larger machines, e.g. > ones that are backed by 1G pages instead of 4k. > Thanks Peter and Axel! As mentioned, my primary motivation was to simply clean up the userfaultfd selftest. Glad to see there might be more use cases. If we can simplify other code as in the case of userfaultfd selftest, that would be a win. -- Mike Kravetz