Re: [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy

jane.chu@xxxxxxxxxx · Fri, 11 Oct 2024 11:28:04 -0700

On 10/10/2024 4:21 PM, Jiaqi Yan wrote:

On Mon, Oct 7, 2024 at 10:24 AM <jane.chu@xxxxxxxxxx> wrote:
On 10/3/2024 4:51 PM, Jiaqi Yan wrote:
soned page (sub- or huge-) will eventually be isolated, because,
The code here is "global policy". The "per-VMA policy", proposed in
0/2 but code not sent, should be able to support isolation + offline
at some point (all VMAs are gone and page becomes free).
"per-VMA policy" sounds interesting.
Another thing I'm curious at is whether you have tested with real
hardware UE - the one that triggers MCE.  When a real UE is consumed by
Yes, with our workload. Can you share more about what is the "training
process"? Is it something to train memory or screen memory errors?
The cover letter mentioned "Machine Learning (ML) workloads", so I used
it as an example.
Got you. In that case, if the ML workload (running in a VM) wants to
do what you described, wouldn't losing 1G hugetlb page due to kernel
offline make the VM/workload even harder to execute recover logic?

Indeed.

As the user application got more sophisticated on recovering from 
poison, what about making the kernel to do the heavy lifting?

Something like by way of userfaultfd,  kernel provides a new/clean 
hugetlb page, copied over good data from the clean subpages and then 
present the clean hugetlb page to user process with indication that 
subpage x is a substitute of the poisoned old subpage x, hence its data 
might need a refill?  I am not sure how exactly to pull this through as 
the even is not a page-fault, but just wondering whether something like 
this is possible.

thanks,

-jane

-jane