On Wed, Aug 28, 2024 at 4:50 PM Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > > On Wed, Aug 28, 2024 at 09:10:34AM -0700, Jiaqi Yan wrote: > > On Wed, Aug 28, 2024 at 7:24 AM Jason Gunthorpe <jgg@xxxxxxxxxx> wrote: > > > > > > On Tue, Aug 27, 2024 at 05:42:21PM -0700, Jiaqi Yan wrote: > > > > > > > Instead of removing the whole pud, can driver or memory_failure do > > > > something similar to non-struct-page-version of split_huge_page? So > > > > driver doesn't need to re-fault good pages back? > > > > > > It would be far nicer if we didn't have to poke a hole in a 1G mapping > > > just for memory failure reporting. > > > > If I follow this, which of the following sounds better? 1. remove pud > > and rely on the driver to re-fault PFNs that it knows are not poisoned > > (what Peter suggested), or 2. keep the pud and allow access to both > > good and bad PFNs. > > In practice I think people will need 2, as breaking up a 1G mapping > just because a few bits are bad will destroy the VM performance. > Totally agreed. > For this the expectation would be for the VM to co-operate and not > keep causing memory failures, or perhaps for the platform to spare in > good memory somehow. Yes, whether a VM gets into a memory-error-consumption loop maliciously or accidentally, a reasonable VMM should have means to detect and break it. > > > Or provide some knob (configured by ?) so that kernel + driver can > > switch between the two? > > This is also sounding reasonable, especially if we need some > alternative protocol to signal userspace about the failed memory > besides fault and SIGBUS. To clarify, what on my mind is a knob say named "sysctl_enable_hard_offline", configured by userspace. To apply to Ankit's memory_failure_pfn patch[*]: static int memory_failure_pfn(unsigned long pfn, int flags) { struct interval_tree_node *node; int res = MF_FAILED; LIST_HEAD(tokill); mutex_lock(&pfn_space_lock); for (node = interval_tree_iter_first(&pfn_space_itree, pfn, pfn); node; node = interval_tree_iter_next(node, pfn, pfn)) { struct pfn_address_space *pfn_space = container_of(node, struct pfn_address_space, node); if (pfn_space->ops) pfn_space->ops->failure(pfn_space, pfn); collect_procs_pgoff(NULL, pfn_space->mapping, pfn, &tokill); if (sysctl_enable_hard_offline) unmap_mapping_range(pfn_space->mapping, pfn << PAGE_SHIFT, PAGE_SIZE, 0); res = MF_RECOVERED; } mutex_unlock(&pfn_space_lock); if (res == MF_FAILED) return action_result(pfn, MF_MSG_PFN_MAP, res); flags |= MF_ACTION_REQUIRED | MF_MUST_KILL; kill_procs(&tokill, true, false, pfn, flags); return action_result(pfn, MF_MSG_PFN_MAP, MF_RECOVERED); } I think we still want to attempt to SIGBUS userspace, regardless of doing unmap_mapping_range or not. [*] https://lore.kernel.org/lkml/20231123003513.24292-2-ankita@xxxxxxxxxx/#t > > Jason