On 06/08/23 11:50, Yang Shi wrote: > On Wed, Jun 7, 2023 at 11:34 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > > > > On 08.06.23 02:02, David Rientjes wrote: > > > On Wed, 7 Jun 2023, Mike Kravetz wrote: > > > > > >>>>>> Are there strong objections to extending hugetlb for this support? > > >>>>> > > >>>>> I don't want to get too involved in this discussion (busy), but I > > >>>>> absolutely agree on the points that were raised at LSF/MM that > > >>>>> > > >>>>> (A) hugetlb is complicated and very special (many things not integrated > > >>>>> with core-mm, so we need special-casing all over the place). [example: > > >>>>> what is a pte?] > > >>>>> > > >>>>> (B) We added a bunch of complexity in the past that some people > > >>>>> considered very important (and it was not feature frozen, right? ;) ). > > >>>>> Looking back, we might just not have done some of that, or done it > > >>>>> differently/cleaner -- better integrated in the core. (PMD sharing, > > >>>>> MAP_PRIVATE, a reservation mechanism that still requires preallocation > > >>>>> because it fails with NUMA/fork, ...) > > >>>>> > > >>>>> (C) Unifying hugetlb and the core looks like it's getting more and more > > >>>>> out of reach, maybe even impossible with all the complexity we added > > >>>>> over the years (well, and keep adding). > > >>>>> > > >>>>> Sure, HGM for the purpose of better hwpoison handling makes sense. But > > >>>>> hugetlb is probably 20 years old and hwpoison handling probably 13 years > > >>>>> old. So we managed to get quite far without that optimization. > > >>>>> > > > > > > Sane handling for memory poisoning and optimizations for live migration > > > are both much more important for the real-world 1GB hugetlb user, so it > > > doesn't quite have that lengthy of a history. > > > > > > Unfortuantely, cloud providers receive complaints about both of these from > > > customers. They are one of the most significant causes for poor customer > > > experience. > > > > > > While people have proposed 1GB THP support in the past, it was nacked, in > > > part, because of the suggestion to just use existing 1GB support in > > > hugetlb instead :) > > Yes, but it was before HGM was proposed, we may revisit it. > Adding Zi Yan on CC as the person driving 1G THP. > > > > Yes, because I still think that the use for "transparent" (for the user) > > nowadays is very limited and not worth the complexity. > > > > IMHO, what you really want is a pool of large pages that (guarantees > > about availability and nodes) and fine control about who gets these > > pages. That's what hugetlb provides. > > The most concern for 1G THP is the allocation time. But I don't think > it is a no-go for allocating THP from a preallocated pool, for > example, CMA. I seem to remember Zi trying to use CMA for 1G THP allocations. However, I am not sure if using CMA would be sufficient. IIUC, allocating from CMA could still require page migrations to put together a 1G contiguous area. In a pool as used by hugetlb, 1G pages are pre-allocated and sitting in the pool. The downside of such a pool is that the memory can not be used for other purposes and sits 'idle' if not allocated. Hate to even bring this up, but there are complaints today about 'allocation time' of 1GB pages from the hugetlb pool. This 'allocation time' is actually the time it takes to clear/zero 1G of memory. Only reason I mention is using something like CMA to allocate 1G pages (at fault time) may add unacceptable latency. > > > > In contrast to THP, you don't want to allow for > > * Partially mmap, mremap, munmap, mprotect them > > * Partially sharing then / COW'ing them > > * Partially mixing them with other anon pages (MADV_DONTNEED + refault) > > IIRC, QEMU treats hugetlbfs as 2M block size, we should be able to > teach QEMU to treat tmpfs + THP as 2M block size too. I used to have a > patch to make stat.st_blksize return THP size for tmpfs (89fdcd262fd4 > mm: shmem: make stat.st_blksize return huge page size if THP is on). > So when the applications are aware of the 2M or 1G page/block size, > hopefully it may help reduce the partial mapping things. But I'm not > an expert on QEMU, I may miss something. > > > * Exclude them from some features KSM/swap > > * (swap them out and eventually split them for that) > > We have "noswap" mount option for tmpfs now, so swap is not a problem. > > But we may lose some features, for example, PMD sharing, hugetlb > cgroup, etc. Not sure whether they are a showstopper or not. > > So it sounds easier to have 1G THP than HGM IMHO if I don't miss > something vital. I have always wanted to experiment with having THP use a pre-allocated pool for huge page allocations. Of course, this adds the complication of what to do when the pool is exhausted. Perhaps Zi has performed such experiments? -- Mike Kravetz