Re: [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/10/2024 4:21 PM, Jiaqi Yan wrote:

On Mon, Oct 7, 2024 at 10:24 AM <jane.chu@xxxxxxxxxx> wrote:
On 10/3/2024 4:51 PM, Jiaqi Yan wrote:
soned page (sub- or huge-) will eventually be isolated, because,
The code here is "global policy". The "per-VMA policy", proposed in
0/2 but code not sent, should be able to support isolation + offline
at some point (all VMAs are gone and page becomes free).
"per-VMA policy" sounds interesting.
Another thing I'm curious at is whether you have tested with real
hardware UE - the one that triggers MCE.  When a real UE is consumed by
Yes, with our workload. Can you share more about what is the "training
process"? Is it something to train memory or screen memory errors?
The cover letter mentioned "Machine Learning (ML) workloads", so I used
it as an example.
Got you. In that case, if the ML workload (running in a VM) wants to
do what you described, wouldn't losing 1G hugetlb page due to kernel
offline make the VM/workload even harder to execute recover logic?

Indeed.

As the user application got more sophisticated on recovering from poison, what about making the kernel to do the heavy lifting?

Something like by way of userfaultfd,  kernel provides a new/clean hugetlb page, copied over good data from the clean subpages and then present the clean hugetlb page to user process with indication that subpage x is a substitute of the poisoned old subpage x, hence its data might need a refill?  I am not sure how exactly to pull this through as the even is not a page-fault, but just wondering whether something like this is possible.

thanks,

-jane


-jane





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux