Hi Jane, On Wed, Oct 2, 2024 at 4:50 PM <jane.chu@xxxxxxxxxx> wrote: > > Hi, > > On 9/23/2024 9:39 PM, Jiaqi Yan wrote: > > > > + /* > > + * On ARM64, if APEI failed to claims SEA, (e.g. GHES driver doesn't > > + * register to SEA notifications from firmware), memory_failure will > > + * never be synchrounous to the error consumption thread. Notifying > > + * it via SIGBUS synchrnously has to be done by either core kernel in > > + * do_mem_abort, or KVM in kvm_handle_guest_abort. > > + */ > > + if (!sysctl_enable_hard_offline) { > > + pr_info_once("%#lx: disabled by /proc/sys/vm/enable_hard_offline\n", pfn); > > + kill_procs_now(p, pfn, flags, page_folio(p)); > > + res = -EOPNOTSUPP; > > + goto unlock_mutex; > > + } > > + > > I am curious why the SIGBUS is sent without setting PG_hwpoison in the > page. In 0/2 there seems to be indication about threads coordinate > with each other such that clean subpages in a poisoned hugetlb page > continue to be accessible, and at some point, (or perhaps I misread), > the poisoned page (sub- or huge-) will eventually be isolated, because, The code here is "global policy". The "per-VMA policy", proposed in 0/2 but code not sent, should be able to support isolation + offline at some point (all VMAs are gone and page becomes free). > it's unthinkable to let a poisoned page laying around and kernel treats > it like a clean page ? But I'm not sure how do you plan to handle it > without PG_hwpoison while hard_offline is disabled globally. It will become the responsibility of a control plan running in userspace. For example, the control plan immediately prevents starting of any new workload/VM, but chooses to wait until memory errors exceed a certain threshold, or hold on to the hosts until all workloads/VMs are migrated and then repair the machine. Not setting PG_hwpoison is indeed a big difference and risk, so it needs to be carefully handled by userspace. > > Another thing I'm curious at is whether you have tested with real > hardware UE - the one that triggers MCE. When a real UE is consumed by Yes, with our workload. Can you share more about what is the "training process"? Is it something to train memory or screen memory errors? > the training process, the user process must longjmp out in order to > avoid getting stuck at the same instruction that fetched a UE memory. > Given a longjmp is needed (unless I am missing something), the training > process is already in a situation where it has to figure out things like > rewind, where-to-restart-from, does it even keep states? etc. On the > whole, whether the burden to ask user application to deal with what's > lacking in the kernel, namely the lack of splitting up a hugetlb page, > is worthwhile, is something that need to be weighed over. For sure, and that's why I put a lot of the word in the cover letter to talk about 2 use cases where "user application to deal with what's lacking in the kernel is worthwhile". > > Thanks, > > -jane > >