On Thu, Jun 15, 2023 at 1:30 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 15.06.23 10:04, Michal Hocko wrote: > > On Wed 14-06-23 16:04:58, Mike Kravetz wrote: > >> On 06/12/23 18:59, David Rientjes wrote: > >>> This week's topic will be a technical brainstorming session on HugeTLB > >>> convergence with the core MM. This has been discussed most recently in > >>> this thread: > >>> https://lore.kernel.org/linux-mm/ZIOEDTUBrBg6tepk@xxxxxxxxxxxxxxxxxxxx/T/ > >> > >> Thank you David for putting this session together! And, thanks to everyone > >> who participated. > >> > >> Following up on linux-mm with most active participants on Cc (sorry if I > >> missed someone). If it makes more sense to continue the above thread, > >> please move there. > >> > >> Even though everyone knows that hugetlb is special cased throughout the > >> core mm, it came to a head with the proposed introduction of HGM. TBH, > >> few people in the core mm community paid much attention to HGM when first > >> introduced. A LSF/MM session was then dedicated to the discussion of > >> HGM with the outcome being the suggestion to create a new filesystem/driver > >> (hugetlb2 if you will) that would satisfy the use cases requiring HGM. > >> One thing that was not emphasized at LSF/MM is that there are existing > >> hugetlb users experiencing major issues that could be addressed with HGM: > >> specifically the issues of memory errors and live migration. That was > >> the starting point for recent discussion in the above thread. > >> > >> I may be wrong, but it appeared the direction of that thread was to > >> first try and unify some of the hugetlb and core mm code. Eliminate > >> some of the special casing. If hugetlb was less of a special case, then > >> perhaps HGM would be more acceptable. That is the impression I (perhaps > >> incorrectly) had going into today's session. > > > > My impression from the discussion yesterday was that the level of > > unification would need to be really large and time consuming in order to > > be useful for the HGM patchset to be in a more maintainable form. The > > final outcome is quite hard to predict at this stage. > > > >> During today's session, we often discussed what would/could be introduced > >> in a hugetlb v2. The idea is that this would be the ideal place for HGM. > >> However, people also made the comparisons to cgroup v1 - v2. Such a > >> redesign provides the needed 'clean slate' to do things right, but it > >> does little for existing users who would be unwilling to quickly move off > >> existing hugetlb. > >> > >> We did spend a good chunk of time on hugetlb/core mm unification and > >> removing special casing. In some (most) of these cases, the benefit of > >> removing special cases from core mm would result in adding more code to > >> hugetlb. For example: proper type'ing so that hugetlb does not treat > >> all page table entries as PTEs. Again, I may be wrong but I think > >> people were OK with adding more code (and even complexity) to hugetlb > >> if it eliminated special casing in the core mm. But, there did not > >> seem to be a clear concensus especially with the thought that we may > >> need to double hugetlb code to get types right. > > > > This is primarily your call as a maintainer. If you ask me, hugetlb is > > over complicated in its current form already. Regression are not really > > seldom when code is added which is a signal we are hitting maintenance > > cost walls. This doesn't mean further development is impossible of > > course but it is increasingly more costly AFAICS. > > > >> Unless I missed something, there was no clear direction at the end of this > >> session. I was hoping that we could come up with a plan to address the > >> issues facing today's hugetlb users. IMO, there seems to be two options: > >> 1) Start work on hugetlb v2 with the intention that customers will need > >> to move to this to address their issues. > >> 2) Incorporate functionality like HGM into existing hugetlb. > > > > I fully agree with all that Michal said. > > I'm just going to add that I don't see why anyone would look into a > hugetlbv2 if we're going to use the motivation of "help existing users" > to make hugetlb ever-more complicated and special. "existing users" her > even meaning "people use hugetlb for backing VMs. Now they want to get > postcopy working with less latency." -- which I consider partially a new > use case. > > So working on adding HGM and concurrently starting a hugetlbv2? I don't > think that will happen if we decide on adding HGM and proceeding with > that reasoning about existing users. > > As expressed yesterday, I don't see a fast an clean way to make hugetlb > significantly less special (thanks Willy for the list of odd cases). > > Sure, we can talk about adding pte_t safety, but I don't really see a > way forward to unify page table walking code that way -- there are still > the (PT) locking, PMD sharing, PTE-cont special cases ... but sure, if > anybody wants to work on that, why not. > > Having that said, like Michal, I acknowledge that it is Mikes call > regarding the hugetlb code. I, for my part, will push back on any added > core-mm complexity that adds more special casing for hugetlb. Maybe > there are easy ways to integrate it nicely and that is not really a concern. HGM is mostly contained in the already-existing HugeTLB special cases. HGM doesn't really *add* special cases, it just makes the HugeTLB special cases more complicated. There are a few small ways that HGM touches non-hugetlb code: 1. Mapcount (to make hugetlb use the THP scheme) [1], newer version here[2] 2. madvise (to add MADV_SPLIT and update MADV_COLLAPSE) [3] and [4] 3. A small non-hugetlb changes to page_vma_mapped_walk (provide pte_order)[5] 4. A small special case in try_to_unmap_one and try_to_migrate_one (to check the head page for page flags)[6] 5. smaps stats[7] [1]: https://lore.kernel.org/linux-mm/20230218002819.1486479-6-jthoughton@xxxxxxxxxx/ [2]: https://lore.kernel.org/linux-mm/20230306230004.1387007-1-jthoughton@xxxxxxxxxx/ [3]: https://lore.kernel.org/linux-mm/20230218002819.1486479-10-jthoughton@xxxxxxxxxx/ [4]: https://lore.kernel.org/linux-mm/20230218002819.1486479-35-jthoughton@xxxxxxxxxx/ [5]: https://lore.kernel.org/linux-mm/20230218002819.1486479-27-jthoughton@xxxxxxxxxx/ [6]: https://lore.kernel.org/linux-mm/20230218002819.1486479-29-jthoughton@xxxxxxxxxx/ [7]: https://lore.kernel.org/linux-mm/20230218002819.1486479-39-jthoughton@xxxxxxxxxx/ > > Note that while we've been discussing how HGM would already interfere > with core-mm, we've not even started discussing how actual > MADV_SPLIT/MADV_COLLAPSE/page poisioning ... would affect core-mm and > require special-casing for hugetlb. > > I, for my part, will explore a bit the mapcount topic (as time permits) > and see if we can come up at least with a unified mapcount approach > (e.g., sub-page mapcount?). But I suspect even figuring that out will > take quite a while already ... Thanks! Simply using the current THP mapcount scheme with HGM isn't great (but IIUC this isn't blocking HGM). By using this scheme, HugeTLB loses the vmemmap optimization / page struct freeing when HGM is in use, and, of course, this scheme gets slow with very large folios.