Re: [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Oct 7, 2024 at 10:24 AM <jane.chu@xxxxxxxxxx> wrote:
>
> On 10/3/2024 4:51 PM, Jiaqi Yan wrote:
> > soned page (sub- or huge-) will eventually be isolated, because,
> > The code here is "global policy". The "per-VMA policy", proposed in
> > 0/2 but code not sent, should be able to support isolation + offline
> > at some point (all VMAs are gone and page becomes free).
> "per-VMA policy" sounds interesting.
> >> Another thing I'm curious at is whether you have tested with real
> >> hardware UE - the one that triggers MCE.  When a real UE is consumed by
> > Yes, with our workload. Can you share more about what is the "training
> > process"? Is it something to train memory or screen memory errors?
>
> The cover letter mentioned "Machine Learning (ML) workloads", so I used
> it as an example.

Got you. In that case, if the ML workload (running in a VM) wants to
do what you described, wouldn't losing 1G hugetlb page due to kernel
offline make the VM/workload even harder to execute recover logic?

>
> -jane
>





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux