Re: [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy

Jiaqi Yan <jiaqiyan@xxxxxxxxxx> · Tue, 15 Oct 2024 16:45:49 -0700

On Fri, Oct 11, 2024 at 11:28 AM <jane.chu@xxxxxxxxxx> wrote:
>
> On 10/10/2024 4:21 PM, Jiaqi Yan wrote:
>
> > On Mon, Oct 7, 2024 at 10:24 AM <jane.chu@xxxxxxxxxx> wrote:
> >> On 10/3/2024 4:51 PM, Jiaqi Yan wrote:
> >>> soned page (sub- or huge-) will eventually be isolated, because,
> >>> The code here is "global policy". The "per-VMA policy", proposed in
> >>> 0/2 but code not sent, should be able to support isolation + offline
> >>> at some point (all VMAs are gone and page becomes free).
> >> "per-VMA policy" sounds interesting.
> >>>> Another thing I'm curious at is whether you have tested with real
> >>>> hardware UE - the one that triggers MCE.  When a real UE is consumed by
> >>> Yes, with our workload. Can you share more about what is the "training
> >>> process"? Is it something to train memory or screen memory errors?
> >> The cover letter mentioned "Machine Learning (ML) workloads", so I used
> >> it as an example.
> > Got you. In that case, if the ML workload (running in a VM) wants to
> > do what you described, wouldn't losing 1G hugetlb page due to kernel
> > offline make the VM/workload even harder to execute recover logic?
>
> Indeed.
>
> As the user application got more sophisticated on recovering from
> poison, what about making the kernel to do the heavy lifting?

I think there are two things.

First, if userspace claims it has enough or sophisticated recovery
ability (assume we trust it), can it take full control of what happens
to the hardware poisoned memory page it **owns**?
My answer to this question is yes. The reason is I believe the kernel
has a limited ability to do memory failure recovery (MFR) optimally
for all userspace. Current hard offline support in the kernel has also
made userspace recovery hard, so userspace deserve a position in MFR.

Second, what is the granularity of the control? This patch makes the
control applicable to every process. So what about making it
controllable only by the userspace process that owns the memory page?
Kernel can still do whatever the heavy lifting (hard offline, set
HWPoison) **after** the owning userspace unclaims the control, or
exits.

Another way to "disable hardoffline but still set HWPoison" I can
think of is, make the HWPOISON flag apply at page_size level, instead
of always set at the compound head. At least from hugetlb's
perspective, is it a good idea?

>
> Something like by way of userfaultfd,  kernel provides a new/clean
> hugetlb page, copied over good data from the clean subpages and then
> present the clean hugetlb page to user process with indication that
> subpage x is a substitute of the poisoned old subpage x, hence its data
> might need a refill?  I am not sure how exactly to pull this through as
> the even is not a page-fault, but just wondering whether something like
> this is possible.
>
> thanks,
>
> -jane
>
> >
> >> -jane
> >>