Re: [RFC] Make the memory failure blast radius more precise

David Rientjes <rientjes@xxxxxxxxxx> · Tue, 23 Jun 2020 21:32:41 -0700 (PDT)

On Tue, 23 Jun 2020, Luck, Tony wrote:

> > Hardware actually tells us the blast radius of the error, but we ignore
> > it and take out the entire page.  We've had a customer request to know
> > exactly how much of the page is damaged so they can avoid reconstructing
> > an entire 2MB page if only a single cacheline is damaged.
> > 
> > This is only a strawman that I did in an hour or two; I'd appreciate
> > architectural-level feedback.  Should I just convert memory_failure() to
> > always take an address & granularity?  Should I create a struct to pass
> > around (page, phys, granularity) instead of reconstructing the missing
> > pieces in half a dozen functions?  Is this functionality welcome at all,
> > or is the risk of upsetting applications which expect at least a page
> > of granularity too high?
> 
> What is the interface to these applications that want finer granularity?
> 
> Current code does very poorly with hugetlbfs pages ... user loses the
> whole 2 MB or 1GB. That's just silly (though I've been told that it is
> hard to fix because allowing a hugetlbfs page to be broken up at an arbitrary
> time as the result of a mahcine check means that the kernel needs locking
> around a bunch of fas paths that currently assume that a huge page will
> stay being a huge page).
> 

Thanks for bringing this up, Tony.  Mike Kravetz pointed me to this thread 
(thanks Mike!) so let's add him in explicitly as well as Andrea, Peter, 
and David from Red Hat who we've been discussing an idea with that may 
introduce exactly this needed support but for different purposes :)  The 
timing of this thread is _uncanny_.

To improve the performance of userfaultfd for the purposes of post-copy 
live migration we need to reduce the granularity in which pages are 
migrated; we're looking at this from a 1GB gigantic page perspective but 
the same arguments can likely be had for 2MB hugepages as well.  1GB pages 
are too much of a bottleneck and, as you bring up, 1GB is simply too much 
memory to poison :)  We don't have 1GB thp support so the big idea was to 
introduce thp-like DoubleMap support into hugetlbfs for the purposes of 
post-copy live migration and then I had the idea that this could be 
extended to memory failure as well.

(We don't see the lack of 1GB thp here as a deficiency for anything other 
than these two issues, hugetlb provides strong guarantees.)

I don't want to hijack Matthew's thread which is primarily about DAX, but 
did get intrigued by your concerns about hugetlbfs page poisoning.  We can 
fork the thread off here to discuss only the hugetlb application of this 
if it makes sense to you or you'd like to collaborate on it as well.

The DoubleMap support would allow us to map the 1GB gigantic pages with 
the PUD and the PMDs as well (and, further, the 2MB hugepages with the PMD 
and PTEs) so that we can copy fragments into PMDs or PTEs and we don't 
need to migrate the entire gigantic page.  Any access triggers #PF through 
hugetlb_no_page() -> handle_userfault() which would trigger another 
UFFDIO_COPY and map another fragment.

Assume a world where this DoubleMap support already exists for hugetlb 
pages today and all the invariants including page migration are fixed up 
(since a PTE can now map a hugetlb page and a PMD can now map a gigantic 
hugetlb page).  It *seems* like we'd be able to reduce the blast radius 
here too on a hard memory failure: dissolve the gigantic page in place, 
SIGBUS/SIGKILL on the bad PMD or PTE, and avoid poisoning the head of the 
hugetlb page.  We agree that poisoning this large amount of memory is not 
ideal :)

Anyway, this was some brainstorming that I was doing with Mike and the 
others based on the idea of using DoubleMap support for post-copy live 
migration.  If you would be interested or would like to collaborate on 
it, we'd love to talk.