Re: [Lsf-pc] [LSF/MM/BPF TOPIC] HGM for hugetlbfs

Dan Williams <dan.j.williams@xxxxxxxxx> · Thu, 8 Jun 2023 14:54:15 -0700

Mike Kravetz wrote:
> On 06/07/23 10:13, David Hildenbrand wrote:
[..]
> I am struggling with how to support existing hugetlb users that are running
> into issues like memory errors on hugetlb pages today.  And, yes that is a
> source of real customer issues.  They are not really happy with the current
> design that a single error will take out a 1G page, and their VM or
> application.  Moving to THP is not likely as they really want a pre-allocated
> pool of 1G pages.  I just don't have a good answer for them.

Is it the reporting interface, or the fact that the page gets offlined
too quickly? I.e. if the 1GB page was unmapped from userspace per usual
memory-failure, but the application had an opportunity to record what
got clobbered on a smaller granularity and then ask the kernel to repair
the page, would that relieve some pain? Where repair is atomically
writing a full cacheline of zeroes, or copying around the poison to a
new page and returning the old one to broken down and only have the
single 4K page with error quarantined.