Mike Kravetz wrote: > On 06/07/23 10:13, David Hildenbrand wrote: [..] > I am struggling with how to support existing hugetlb users that are running > into issues like memory errors on hugetlb pages today. And, yes that is a > source of real customer issues. They are not really happy with the current > design that a single error will take out a 1G page, and their VM or > application. Moving to THP is not likely as they really want a pre-allocated > pool of 1G pages. I just don't have a good answer for them. Is it the reporting interface, or the fact that the page gets offlined too quickly? I.e. if the 1GB page was unmapped from userspace per usual memory-failure, but the application had an opportunity to record what got clobbered on a smaller granularity and then ask the kernel to repair the page, would that relieve some pain? Where repair is atomically writing a full cacheline of zeroes, or copying around the poison to a new page and returning the old one to broken down and only have the single 4K page with error quarantined.