Re: [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On 9/23/2024 9:39 PM, Jiaqi Yan wrote:
+ /*
+	 * On ARM64, if APEI failed to claims SEA, (e.g. GHES driver doesn't
+	 * register to SEA notifications from firmware), memory_failure will
+	 * never be synchrounous to the error consumption thread. Notifying
+	 * it via SIGBUS synchrnously has to be done by either core kernel in
+	 * do_mem_abort, or KVM in kvm_handle_guest_abort.
+	 */
+	if (!sysctl_enable_hard_offline) {
+		pr_info_once("%#lx: disabled by /proc/sys/vm/enable_hard_offline\n", pfn);
+		kill_procs_now(p, pfn, flags, page_folio(p));
+		res = -EOPNOTSUPP;
+		goto unlock_mutex;
+	}
+

I am curious why the SIGBUS is sent without setting PG_hwpoison in the page.   In 0/2 there seems to be indication about threads coordinate with each other such that clean subpages in a poisoned hugetlb page continue to be accessible, and at some point, (or perhaps I misread), the poisoned page (sub- or huge-) will eventually be isolated, because, it's unthinkable to let a poisoned page laying around and kernel treats it like a clean page ?  But I'm not sure how do you plan to handle it without PG_hwpoison while hard_offline is disabled globally.

Another thing I'm curious at is whether you have tested with real hardware UE - the one that triggers MCE.  When a real UE is consumed by the training process, the user process must longjmp out in order to avoid getting stuck at the same instruction that fetched a UE memory.  Given a longjmp is needed (unless I am missing something), the training process is already in a situation where it has to figure out things like rewind, where-to-restart-from, does it even keep states? etc. On the whole, whether the burden to ask user application to deal with what's lacking in the kernel, namely the lack of splitting up a hugetlb page, is worthwhile, is something that need to be weighed over.

Thanks,

-jane






[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux