On Tue, Nov 29, 2022 at 09:31:15PM -0800, David Rientjes wrote: > On Thu, 3 Nov 2022, Jiaqi Yan wrote: > > > This RFC is a followup for [1]. We’d like to first revisit the problem > > statement, then explain the motivation for kernel support of memory > > error detection. We attempt to answer two key questions raised in the > > initial memory-scanning based solution: what memory to scan and how the > > scanner should be designed. Different from what [1] originally proposed, > > we think a kernel-driven design similar to khugepaged/kcompactd would > > work better than the userspace-driven design. > > > > Lots of great discussion in this thread, thanks Jiaqi for a very detailed > overview of what is trying to be addressed and the multiple options that > we can consider. > > I think this thread has been a very useful starting point for us to > discuss what should comprise the first patchset. I haven't seen any > objections to enlightening the kernel for this support, but any additional > feedback would indeed be useful. > > Let me suggest a possible way forward: if we can agree on an kernel driven > approach and its design allows for it to be extended for future use cases, > then it should be possible to introduce something generally useful that > can then be built upon later if needed. > > I can think about a couple future use cases that may arise that will > impact the minimal design that you intend to introduce: (1) the ability to > configure a hardware patrol scrubber depending on the platform, if > possible, as a substitute for driving the scanning by a kthread, and (2) > the ability to scan different types of memory rather than all system > memory. > > Imagining the simplest possible design, I assume we could introuce a > /sys/devices/system/node/nodeN/mcescan/* for each NUMA node on the system. > As a foundation, this can include only a "stat" file which provides the > interface to the memory poison subsystem that describes detected errors > and their resolution (this would be a good starting point). > > Building on that, and using your reference to khugepaged, we can add > pages_to_scan and scan_sleep_millisecs files. This will allow us to > control scanning on demotion nodes differently. We'd want the kthread to > be NUMA aware for the memory it is scanning, so this would simply control > when each thread wakes up and how much memory it scans before going to > sleep. Defaults would be disabled, so no kthreads are forked. > > If this needs to be extended later for a hardware patrol scrubber, we'd > make this a request to cpu vendors to make configurable on a per socket > basis and used only with an ACPI capability that would put it under the > control of the kernel in place of the kthread (there would be a single > source of truth for the scan configuration). If this is not possible, > we'd decouple the software and hardware approach and configure the HPS > through the ACPI subsystem independently. > > Subsequently, if there is a need to only scan certain types of memory per > NUMA node, we could introduce a "type" file later under the mcescan > directory. Idea would be to specify a bitmask to include certain memory > types into the scan. Bits for things such as buddy pages, pcp pages, > hugetlb pages, etc. > > [ And if userspace, perhaps non-root, wanted to trigger a scan of its own > virtual memory, for example, another future extension could allow you > to explicitly trigger a scan of the calling process, but this would be > done in process context, not by the kthreads. ] > > If this is deemed acceptable, the minimal viable patchset would: > > - introduce the per-node mcescan directories > > - introduce a "stat" file that would describe the state of memory errors > on each NUMA node and their disposition > > - introduce a per-node kthread driven by pages_to_scan and > scan_sleep_millisecs to do software controlled memory scanning > > All future possible use cases could be extended using this later if the > demand arises. > > Thoughts? It would be very useful to agree on a path forward since I > think this would be generally useful for the kernel. Thank you for the ideas, the above looks to me simple enough to start with. I think that one point not mentioned yet is how the in-kernel scanner finds a broken page before the page is marked by PG_hwpoison. Some mechanism similar to mcsafe-memcpy could be used, but maybe memcpy is not necessary because we just want to check the healthiness of pages. So a core routine like mcsafe-read would be introduced in the first patchset (or we already have it)? Thanks, Naoya Horiguchi