On Wed, Jun 7, 2023 at 11:34 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 08.06.23 02:02, David Rientjes wrote: > > On Wed, 7 Jun 2023, Mike Kravetz wrote: > > > >>>>>> Are there strong objections to extending hugetlb for this support? > >>>>> > >>>>> I don't want to get too involved in this discussion (busy), but I > >>>>> absolutely agree on the points that were raised at LSF/MM that > >>>>> > >>>>> (A) hugetlb is complicated and very special (many things not integrated > >>>>> with core-mm, so we need special-casing all over the place). [example: > >>>>> what is a pte?] > >>>>> > >>>>> (B) We added a bunch of complexity in the past that some people > >>>>> considered very important (and it was not feature frozen, right? ;) ). > >>>>> Looking back, we might just not have done some of that, or done it > >>>>> differently/cleaner -- better integrated in the core. (PMD sharing, > >>>>> MAP_PRIVATE, a reservation mechanism that still requires preallocation > >>>>> because it fails with NUMA/fork, ...) > >>>>> > >>>>> (C) Unifying hugetlb and the core looks like it's getting more and more > >>>>> out of reach, maybe even impossible with all the complexity we added > >>>>> over the years (well, and keep adding). > >>>>> > >>>>> Sure, HGM for the purpose of better hwpoison handling makes sense. But > >>>>> hugetlb is probably 20 years old and hwpoison handling probably 13 years > >>>>> old. So we managed to get quite far without that optimization. > >>>>> > > > > Sane handling for memory poisoning and optimizations for live migration > > are both much more important for the real-world 1GB hugetlb user, so it > > doesn't quite have that lengthy of a history. > > > > Unfortuantely, cloud providers receive complaints about both of these from > > customers. They are one of the most significant causes for poor customer > > experience. > > > > While people have proposed 1GB THP support in the past, it was nacked, in > > part, because of the suggestion to just use existing 1GB support in > > hugetlb instead :) Yes, but it was before HGM was proposed, we may revisit it. > > Yes, because I still think that the use for "transparent" (for the user) > nowadays is very limited and not worth the complexity. > > IMHO, what you really want is a pool of large pages that (guarantees > about availability and nodes) and fine control about who gets these > pages. That's what hugetlb provides. The most concern for 1G THP is the allocation time. But I don't think it is a no-go for allocating THP from a preallocated pool, for example, CMA. > > In contrast to THP, you don't want to allow for > * Partially mmap, mremap, munmap, mprotect them > * Partially sharing then / COW'ing them > * Partially mixing them with other anon pages (MADV_DONTNEED + refault) IIRC, QEMU treats hugetlbfs as 2M block size, we should be able to teach QEMU to treat tmpfs + THP as 2M block size too. I used to have a patch to make stat.st_blksize return THP size for tmpfs (89fdcd262fd4 mm: shmem: make stat.st_blksize return huge page size if THP is on). So when the applications are aware of the 2M or 1G page/block size, hopefully it may help reduce the partial mapping things. But I'm not an expert on QEMU, I may miss something. > * Exclude them from some features KSM/swap > * (swap them out and eventually split them for that) We have "noswap" mount option for tmpfs now, so swap is not a problem. But we may lose some features, for example, PMD sharing, hugetlb cgroup, etc. Not sure whether they are a showstopper or not. So it sounds easier to have 1G THP than HGM IMHO if I don't miss something vital. > > Because you don't want to get these pages PTE-mapped by the system > *unless* there is a real reason (HGM, hwpoison) -- you want guarantees. > Once such a page is PTE-mapped, you only want to collapse in place. > > But you don't want special-HGM, you simply want the core to PTE-map them > like a (file) THP. > > IMHO, getting that realized much easier would be if we wouldn't have to > care about some of the hugetlb complexity I raised (MAP_PRIVATE, PMD > sharing), but maybe there is a way ... > > > > >>>>> Absolutely, HGM for better postcopy live migration also makes sense, I > >>>>> guess nobody disagrees on that. > >>>>> > >>>>> > >>>>> But as discussed in that session, maybe we should just start anew and > >>>>> implement something that integrates nicely with the core , instead of > >>>>> making hugetlb more complicated and even more special. > >>>>> > > > > Certainly an ideal would be where we could support everybody's use cases > > in a much more cohesive way with the rest of the core MM. I'm > > particularly concerned about how long it will take to get to that state > > even if we had kernel developers committed to doing the work. Even if we > > had a design for this new subsystem that was more tightly coupled with the > > core MM, it would take O(years) to implement, test, extend for other > > architectures, and that's before any existing of users of hugetlb could > > make the changes in the rest of their software stack to support it. > > One interesting experiment would be, to just take hugetlb and remove all > complexity (strip it to it's core: a pooling of large pages without > special MAP_PRIVATE support, PMD sharing, reservations, ...). Then, see > how to get core-mm to just treat them like PUD/PMD-mapped folios that > can get PTE-mapped -- just like we have with FS-level THP. > > Maybe we could then factor out what's shared with the old hugetlb > implementations (e.g., pooling) and have both co-exist (e.g., configured > at runtime). > > The user-space interface for hugetlb would not change (well, except fail > MAP_PRIVATE for now) > > (especially, no messing with anon hugetlb pages) > > > Again, the spirit would be "teach the core to just treat them like > folios that can get PTE-mapped" instead of "add HGM to hugetlb". If we > can achieve that without a hugetlb v2, great. But i think that will be > harder .... but I might be just wrong. > > -- > Cheers, > > David / dhildenb > >