On 06/07/23 10:13, David Hildenbrand wrote: > On 07.06.23 09:51, Yosry Ahmed wrote: > > On Wed, Jun 7, 2023 at 12:38 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > > > > > On 07.06.23 00:40, David Rientjes wrote: > > > > On Fri, 2 Jun 2023, Mike Kravetz wrote: > > > > > > > > > The benefit of HGM in the case of memory errors is fairly obvious. As > > > > > mentioned above, when a memory error is encountered on a hugetlb page, > > > > > that entire hugetlb page becomes inaccessible to the application. Losing, > > > > > 1G or even 2M of data is often catastrophic for an application. There > > > > > is often no way to recover. It just makes sense that recovering from > > > > > the loss of 4K of data would generally be easier and more likely to be > > > > > possible. Today, when Oracle DB encounters a hard memory error on a > > > > > hugetlb page it will shutdown. Plans are currently in place repair and > > > > > recover from such errors if possible. Isolating the area of data loss > > > > > to a single 4K page significantly increases the likelihood of repair and > > > > > recovery. > > > > > > > > > > Today, when a memory error is encountered on a hugetlb page an > > > > > application is 'notified' of the error by a SIGBUS, as well as the > > > > > virtual address of the hugetlb page and it's size. This makes sense as > > > > > hugetlb pages are accessed by a single page table entry, so you get all > > > > > or nothing. As mentioned by James above, this is catastrophic for VMs > > > > > as the hypervisor has just been told that 2M or 1G is now inaccessible. > > > > > With HGM, we can isolate such errors to 4K. > > > > > > > > > > Backing VMs with hugetlb pages is a real use case today. We are seeing > > > > > memory errors on such hugetlb pages with the result being VM failures. > > > > > One of the advantages of backing VMs with THPs is that they are split in > > > > > the case of memory errors. HGM would allow similar functionality. > > > > > > > > Thanks for this context, Mike, it's very useful. > > > > > > > > I think everybody is aligned on the desire to map memory at smaller > > > > granularities for multiple use cases and it's fairly clear that these use > > > > cases are critically important to multiple stakeholders. > > > > > > > > I think the open question is whether this functionality is supported in > > > > hugetlbfs (like with HGM) or that there is a hard requirement that we must > > > > use THP for this support. > > > > > > > > I don't think that hugetlbfs is feature frozen, but if there's a strong > > > > bias toward not merging additional complexity into the subsystem that > > > > would useful to know. I personally think the critical use cases described > > > > > > At least I, attending that session, thought that it was clear that the > > > majority of the people speaking up clearly expressed "no more added > > > complexity". So I think there is a clear strong bias, at least from the > > > people attending that session. > > > > > > > > > > above justify the added complexity of HGM to hugetlb and we wouldn't be > > > > blocked by the long standing (15+ years) desire to mesh hugetlb into the > > > > core MM subsystem before we can stop the pain associated with memory > > > > poisoning and live migration. > > > > > > > > Are there strong objections to extending hugetlb for this support? > > > > > > I don't want to get too involved in this discussion (busy), but I > > > absolutely agree on the points that were raised at LSF/MM that > > > > > > (A) hugetlb is complicated and very special (many things not integrated > > > with core-mm, so we need special-casing all over the place). [example: > > > what is a pte?] > > > > > > (B) We added a bunch of complexity in the past that some people > > > considered very important (and it was not feature frozen, right? ;) ). > > > Looking back, we might just not have done some of that, or done it > > > differently/cleaner -- better integrated in the core. (PMD sharing, > > > MAP_PRIVATE, a reservation mechanism that still requires preallocation > > > because it fails with NUMA/fork, ...) > > > > > > (C) Unifying hugetlb and the core looks like it's getting more and more > > > out of reach, maybe even impossible with all the complexity we added > > > over the years (well, and keep adding). > > > > > > Sure, HGM for the purpose of better hwpoison handling makes sense. But > > > hugetlb is probably 20 years old and hwpoison handling probably 13 years > > > old. So we managed to get quite far without that optimization. > > > > > > Absolutely, HGM for better postcopy live migration also makes sense, I > > > guess nobody disagrees on that. > > > > > > > > > But as discussed in that session, maybe we should just start anew and > > > implement something that integrates nicely with the core , instead of > > > making hugetlb more complicated and even more special. > > > > > > > > > Now, we all know, nobody wants to do the heavy lifting for that, that's > > > why we're discussing how to get in yet another complicated feature. > > > > If nobody wants to do the heavy lifting and unifying hugetlb with core > > MM is becoming impossible as you state, then does adding another > > feature to hugetlb (that we are all agreeing is useful for multiple > > use cases) really making things worse? In other words, if someone > > Well, if we (as a community) reject more complexity and outline an > alternative of what would be acceptable (rewrite), people that really want > these new features will *have to* do the heavy lifting. > > [and I see many people from employers that might have the capacity to do the > heavy lifting if really required being involved in the discussion around HGM > :P ] > > > decides tomorrow to do the heavy lifting, how much harder does this > > become because of HGM, if any? > > > > I am the farthest away from being an expert here, I am just an > > observer here, but if the answer to the above question is "HGM doesn't > > actually make it worse" or "HGM only slightly makes things harder", > > then I naively think that it's something that we should do, from a > > pure cost-benefit analysis. > > Well, there is always the "maintainability" aspect, because upstream has to > maintain whatever complexity gets merged. No matter what, we'll have to keep > maintaining the current set of hugetlb features until we can eventually > deprecate it/some in the far, far future. > > I, for my part, am happy as long as I can stay away as far as possible from > hugetlb code. Again, Mike is the maintainer. Thanks for the reminder :) Maintainability is my primary concern with HGM. That is one of the reasons I proposed James pitch the topic at LSFMM. Even though I am the 'maintainer' changes introduced by HGM will impact others working in mm. > What I saw so far regarding HGM does not count as "slightly makes things > harder". > > > Again, I don't have a lot of context here, and I understand everyone's > > frustration with the current state of hugetlb. Just my 2 cents. > > The thing is, we all agree that something that hugetlb provides is valuable > (i.e., pool of huge/large pages that we can map large), just that after 20 > years there might be better ways of doing it and integrating it better with > core-mm. I am struggling with how to support existing hugetlb users that are running into issues like memory errors on hugetlb pages today. And, yes that is a source of real customer issues. They are not really happy with the current design that a single error will take out a 1G page, and their VM or application. Moving to THP is not likely as they really want a pre-allocated pool of 1G pages. I just don't have a good answer for them. -- Mike Kravetz