Re: [RFC] Kernel Support of Memory Error Detection.

HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@xxxxxxx> · Wed, 9 Nov 2022 05:29:09 +0000

On Thu, Nov 03, 2022 at 03:50:29PM +0000, Jiaqi Yan wrote:
> This RFC is a followup for [1]. We’d like to first revisit the problem
> statement, then explain the motivation for kernel support of memory
> error detection. We attempt to answer two key questions raised in the
> initial memory-scanning based solution: what memory to scan and how the
> scanner should be designed. Different from what [1] originally proposed,
> we think a kernel-driven design similar to khugepaged/kcompactd would
> work better than the userspace-driven design.
> 
> Problem Statement
> =================
> The ever increasing DRAM size and cost has brought the memory subsystem
> reliability to the forefront of large fleet owners’ concern. Memory
> errors are one of the top hardware failures that cause server and
> workload crashes. Simply deploying extra-reliable DRAM hardware to a
> large-scale computing fleet adds significant cost, e.g., 10% extra cost
> on DRAM can amount to hundreds of millions of dollars.
> 
> Reactive memory poison recovery (MPR), e.g., recovering from MCEs raised
> during an execution context (the kernel mechanisms are MCE handler +
> CONFIG_MEMORY_FAILURE + SIGBUS to the user space process), has been found
> effective in keeping systems resilient from memory errors. However,
> reactive memory poison recovery has several major drawbacks:
> - It requires software systems that access poisoned memory to
>   be specifically designed and implemented to recover from memory errors.
>   Uncorrectable (UC) errors are random, which may happen outside of the
>   enlightened address spaces or execution contexts. The added error
>   recovery capability comes at the cost of added complexity and often
>   impossible to enlighten in 3rd party software.
> - In a virtualized environment, the injected MCEs introduce the same
>   challenge to the guest.
> - It only covers MCEs raised by CPU accesses, but the scope of memory
>   error issue is far beyond that. For example, PCIe devices (e.g. NIC and
>   GPU) accessing poisoned memory cause host crashes when
>   on certain machine configs.
> 
> We want to upstream a patch set that proactively scans the memory DIMMs
> at a configurable rate to detect UC memory errors, and attempts to
> recover the detected memory errors. We call it proactive MPR, which
> provides three benefits to tackle the memory error problem:
> - Proactively scanning memory DIMMs reduces the chance of a correctable
>   error becoming uncorrectable.
> - Once detected, UC errors caught in unallocated memory pages are
>   isolated and prevented from being allocated to an application or the OS.
> - The probability of software/hardware products encountering memory
>   errors is reduced, as they are only exposed to memory errors developed
>   over a window of T, where T stands for the period of scrubbing the
>   entire memory space. Any memory errors that occurred more than T ago
>   should have resulted in custom recovery actions. For example, in a cloud
>   environment VMs can be live migrated to another healthy host.
> 
> Some CPU vendors [2, 3] provide hardware patrol scrubber (HPS) to
> prevent the build up of memory errors. In comparison software memory
> error detector (SW) has pros and cons:
> - SW supports adaptive scanning, i.e. speeds up/down scanning, turns
>   on/off scanning, and yields its own CPU cycles and memory bandwidth.
>   All of these can happen on-the-fly based on the system workload status
>   or administrator’s choice. HPS doesn’t have all these flexibilities.
>   Its patrol speed is usually only configurable at boot time, and it is
>   not able to consider system state. (Note: HPS is a memory controller
>   feature and usually doesn’t consume CPU time).
> - SW can expose controls to scan by memory types, while HPS always scans
>   full system memory. For example, an administrator can use SW to only
>   scan hugetlb memory on the system.
> - SW can scan memory at a finer granularity, for example, having different
>   scan rate per node, or entirely disabled on some node. HPS, however,
>   currently only supports per host scanning.
> - SW can make scan statistics (e.g. X bytes has been scanned for the
>   last Y seconds and Z memory errors are found) easily visible to
>   datacenter administrators, who can schedule maintenance (e.g. migrating
>   running jobs before repairing DIMMs) accordingly.

I think that exposing memory error info in the system to usespace is
useful independent of the new scanner.

> - SW’s functionality is consistent across hardware platforms. HPS’s
>   functionality varies from vendor to vendor. For example, some vendors
>   support shorter scrubbing periods than others, and some vendors may not
>   support memory scrubbing at all.
> - HPS usually doesn’t consume CPU cores but does consume memory
>   controller cycles and memory bandwidth. SW consumes both CPU cycles
>   and memory bandwidth, but is only a problem if administrators opt into
>   the scanning after weighing the cost benefit.
> - As CPU cores are not consumed by HPS, there won’t be any cache impact.
>   SW can utilize prefetchnta (for x86) [4] and equivalent hints for other
>   architectures [5] to minimize cache impact (in case of prefetchnta,
>   completely avoiding L1/L2 cache impact).
> 
> Solution Proposals
> ==================
> 
> What to Scan
> ============
> The initial RFC proposed to scan the **entire system memory**, which
> raised the question of what memory is scannable (i.e. memory accessible
> from kernel direct mapping). We attempt to address this question by
> breaking down the memory types as follows:
> - Static memory types: memory that either stays scannable or unscannable.
>   Well defined examples are hugetlb vs regular memory, node-local memory
>   vs far memory (e.g. CXL or PMEM). While most static memory types are
>   scannable, administrators could disable scanning far memory to avoid
>   messing with the promotion and demotion logic in memory tiring
>   solutions. (The implementation will allow administrators to disable
>   scanning on scannable memory).

I think that another viewpoint of how we prioritize memory type to scan
is kernel vs userspace memory. Current hwpoison mechanism does little to
recover from errors in kernel pages (slab, reserved), so there seesm
little benefit to detect such errors proactively and beforehand.  If the
resource for scanning is limited, the user might think of focusing on
scanning userspace memory.

Thanks,
Naoya Horiguchi