On Mon, Oct 7, 2024 at 10:24 AM <jane.chu@xxxxxxxxxx> wrote: > > On 10/3/2024 4:51 PM, Jiaqi Yan wrote: > > soned page (sub- or huge-) will eventually be isolated, because, > > The code here is "global policy". The "per-VMA policy", proposed in > > 0/2 but code not sent, should be able to support isolation + offline > > at some point (all VMAs are gone and page becomes free). > "per-VMA policy" sounds interesting. > >> Another thing I'm curious at is whether you have tested with real > >> hardware UE - the one that triggers MCE. When a real UE is consumed by > > Yes, with our workload. Can you share more about what is the "training > > process"? Is it something to train memory or screen memory errors? > > The cover letter mentioned "Machine Learning (ML) workloads", so I used > it as an example. Got you. In that case, if the ML workload (running in a VM) wants to do what you described, wouldn't losing 1G hugetlb page due to kernel offline make the VM/workload even harder to execute recover logic? > > -jane >