Re: [RFC] Kernel Support of Memory Error Detection.

HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@xxxxxxx> · Tue, 13 Dec 2022 09:27:47 +0000

On Tue, Nov 29, 2022 at 09:31:15PM -0800, David Rientjes wrote:
> On Thu, 3 Nov 2022, Jiaqi Yan wrote:
> 
> > This RFC is a followup for [1]. We’d like to first revisit the problem
> > statement, then explain the motivation for kernel support of memory
> > error detection. We attempt to answer two key questions raised in the
> > initial memory-scanning based solution: what memory to scan and how the
> > scanner should be designed. Different from what [1] originally proposed,
> > we think a kernel-driven design similar to khugepaged/kcompactd would
> > work better than the userspace-driven design.
> > 
> 
> Lots of great discussion in this thread, thanks Jiaqi for a very detailed 
> overview of what is trying to be addressed and the multiple options that 
> we can consider.
> 
> I think this thread has been a very useful starting point for us to 
> discuss what should comprise the first patchset.  I haven't seen any 
> objections to enlightening the kernel for this support, but any additional 
> feedback would indeed be useful.
> 
> Let me suggest a possible way forward: if we can agree on an kernel driven 
> approach and its design allows for it to be extended for future use cases, 
> then it should be possible to introduce something generally useful that 
> can then be built upon later if needed.
> 
> I can think about a couple future use cases that may arise that will 
> impact the minimal design that you intend to introduce: (1) the ability to 
> configure a hardware patrol scrubber depending on the platform, if 
> possible, as a substitute for driving the scanning by a kthread, and (2) 
> the ability to scan different types of memory rather than all system 
> memory.
> 
> Imagining the simplest possible design, I assume we could introuce a
> /sys/devices/system/node/nodeN/mcescan/* for each NUMA node on the system.  
> As a foundation, this can include only a "stat" file which provides the 
> interface to the memory poison subsystem that describes detected errors 
> and their resolution (this would be a good starting point).
> 
> Building on that, and using your reference to khugepaged, we can add 
> pages_to_scan and scan_sleep_millisecs files.  This will allow us to 
> control scanning on demotion nodes differently.  We'd want the kthread to 
> be NUMA aware for the memory it is scanning, so this would simply control 
> when each thread wakes up and how much memory it scans before going to 
> sleep.  Defaults would be disabled, so no kthreads are forked.
> 
> If this needs to be extended later for a hardware patrol scrubber, we'd 
> make this a request to cpu vendors to make configurable on a per socket 
> basis and used only with an ACPI capability that would put it under the 
> control of the kernel in place of the kthread (there would be a single 
> source of truth for the scan configuration).  If this is not possible, 
> we'd decouple the software and hardware approach and configure the HPS 
> through the ACPI subsystem independently.
> 
> Subsequently, if there is a need to only scan certain types of memory per 
> NUMA node, we could introduce a "type" file later under the mcescan 
> directory.  Idea would be to specify a bitmask to include certain memory 
> types into the scan.  Bits for things such as buddy pages, pcp pages, 
> hugetlb pages, etc.
> 
>  [ And if userspace, perhaps non-root, wanted to trigger a scan of its own 
>    virtual memory, for example, another future extension could allow you 
>    to explicitly trigger a scan of the calling process, but this would be 
>    done in process context, not by the kthreads. ]
> 
> If this is deemed acceptable, the minimal viable patchset would:
> 
>  - introduce the per-node mcescan directories
> 
>  - introduce a "stat" file that would describe the state of memory errors
>    on each NUMA node and their disposition
> 
>  - introduce a per-node kthread driven by pages_to_scan and
>    scan_sleep_millisecs to do software controlled memory scanning
> 
> All future possible use cases could be extended using this later if the 
> demand arises.
> 
> Thoughts?  It would be very useful to agree on a path forward since I 
> think this would be generally useful for the kernel.

Thank you for the ideas, the above looks to me simple enough to start with.
I think that one point not mentioned yet is how the in-kernel scanner finds
a broken page before the page is marked by PG_hwpoison.  Some mechanism
similar to mcsafe-memcpy could be used, but maybe memcpy is not necessary
because we just want to check the healthiness of pages.  So a core routine
like mcsafe-read would be introduced in the first patchset (or we already
have it)?

Thanks,
Naoya Horiguchi