On Tue, 23 Jun 2020, Luck, Tony wrote: > > Hardware actually tells us the blast radius of the error, but we ignore > > it and take out the entire page. We've had a customer request to know > > exactly how much of the page is damaged so they can avoid reconstructing > > an entire 2MB page if only a single cacheline is damaged. > > > > This is only a strawman that I did in an hour or two; I'd appreciate > > architectural-level feedback. Should I just convert memory_failure() to > > always take an address & granularity? Should I create a struct to pass > > around (page, phys, granularity) instead of reconstructing the missing > > pieces in half a dozen functions? Is this functionality welcome at all, > > or is the risk of upsetting applications which expect at least a page > > of granularity too high? > > What is the interface to these applications that want finer granularity? > > Current code does very poorly with hugetlbfs pages ... user loses the > whole 2 MB or 1GB. That's just silly (though I've been told that it is > hard to fix because allowing a hugetlbfs page to be broken up at an arbitrary > time as the result of a mahcine check means that the kernel needs locking > around a bunch of fas paths that currently assume that a huge page will > stay being a huge page). > Thanks for bringing this up, Tony. Mike Kravetz pointed me to this thread (thanks Mike!) so let's add him in explicitly as well as Andrea, Peter, and David from Red Hat who we've been discussing an idea with that may introduce exactly this needed support but for different purposes :) The timing of this thread is _uncanny_. To improve the performance of userfaultfd for the purposes of post-copy live migration we need to reduce the granularity in which pages are migrated; we're looking at this from a 1GB gigantic page perspective but the same arguments can likely be had for 2MB hugepages as well. 1GB pages are too much of a bottleneck and, as you bring up, 1GB is simply too much memory to poison :) We don't have 1GB thp support so the big idea was to introduce thp-like DoubleMap support into hugetlbfs for the purposes of post-copy live migration and then I had the idea that this could be extended to memory failure as well. (We don't see the lack of 1GB thp here as a deficiency for anything other than these two issues, hugetlb provides strong guarantees.) I don't want to hijack Matthew's thread which is primarily about DAX, but did get intrigued by your concerns about hugetlbfs page poisoning. We can fork the thread off here to discuss only the hugetlb application of this if it makes sense to you or you'd like to collaborate on it as well. The DoubleMap support would allow us to map the 1GB gigantic pages with the PUD and the PMDs as well (and, further, the 2MB hugepages with the PMD and PTEs) so that we can copy fragments into PMDs or PTEs and we don't need to migrate the entire gigantic page. Any access triggers #PF through hugetlb_no_page() -> handle_userfault() which would trigger another UFFDIO_COPY and map another fragment. Assume a world where this DoubleMap support already exists for hugetlb pages today and all the invariants including page migration are fixed up (since a PTE can now map a hugetlb page and a PMD can now map a gigantic hugetlb page). It *seems* like we'd be able to reduce the blast radius here too on a hard memory failure: dissolve the gigantic page in place, SIGBUS/SIGKILL on the bad PMD or PTE, and avoid poisoning the head of the hugetlb page. We agree that poisoning this large amount of memory is not ideal :) Anyway, this was some brainstorming that I was doing with Mike and the others based on the idea of using DoubleMap support for post-copy live migration. If you would be interested or would like to collaborate on it, we'd love to talk.