[RFC] Expose a memory poison detector ioctl to user space.

Jue Wang <juew@xxxxxxxxxx> · Mon, 25 Apr 2022 09:34:51 -0700

The ever increasing server size and cost in DRAM has brought the memory
subsystem reliability to the forefront of large fleet owners’ concern.
Memory-error-caused server and workload crashes are ranked #1 among all
hardware failures by a large margin. Deploying extra reliable DRAM adds
significant cost at a fleet size, e.g, 10% extra cost on DRAM can amount
to hundreds of millions worth of dollars spending.

“Reactive” memory poison recovery [3], i.e., recover from MCEs raised
during an execution context (the kernel mechanisms are MCE handler +
CONFIG_MEMORY_FAILURE + SIGBUS to the user space process), has been found
effective in keeping systems resilient from memory errors. However, it has
several major drawbacks:

1. It requires software systems that access poisoned memory to be
specifically designed and implemented to recover from memory errors:
    . Uncorrectable errors are random, which may happen outside of the
      enlightened address spaces or execution contexts.
    . The added error recovery capability comes at a cost of added
      complexity and often not possible to enlighten in 3rd party software.
2. In a virtualized environment, the injected MCEs introduce the same
challenge to the guest.
3. Because of random execution contexts, CPU erratum that are vulnerable to
speculative execution, split cache line accesses, hyperthread buddy
scheduling etc (e.g., [1]) can often turn a recoverable UC error into an
unrecoverable MCE and a system crash.
4. Aside from CPU accesses, NIC or other PCIe devices accessing poisoned
memory cause host crashes in the production system regularly.
5. In a multi-tenant environment, the “reactive” poison recovery is
less effective to the largest workloads compared to the smaller workloads
in that the smaller workloads have a much higher chance to get saved
cleanly as a victim’s neighbor rather than the victim itself.

The goal is to minimize the probability that any software / hardware
component actually gets a chance to consume an error. A possible solution
is to “proactively” look for and detect memory errors before
consumption. Here we assume system software is enlightened to drain the
affected host and migrate the running jobs off to another healthy host as
soon as an error is detected; and is able to recover from, contain, and
emulate the errors that surface up during the migration process.

The main benefits (free memory ratios come from a large production fleet):
1. Memory errors in free memory (~50%) can be completely contained without
impacting software / hardware systems later on.
2. Inside a VM guest, memory errors in free memory (~50%) can be
completely contained with the UCNA injection via CMCI capability being
added to KVM in [2].
3. It’s required to detect memory errors on allocated pages without
impacting the execution or performance of the page owners. For instance,
in the cloud world, the majority of the host memory is pre-allocated as a
guest memory pool and memory errors can emerge well after the guest memory
pool allocation.
4. Early detection and ensured containment (e.g., unmapping and
PG_HWPOISON) can effectively prevent most if not all the crashes due to
CPU erratum ([3], section 3.5.2 - 3.5.5) that “reactive” poison
recovery cannot avoid and these crashes represent >40% of all host crashes
in a production fleet.

The hardware patrol scrubber [4] is evaluated, and the type of performance
(i.e., latency between error emergence and detection are in days to hours)
and error loss rate due to downgrading (or otherwise system instability
due to SRAO MCE broadcast and overflow) do not meet the requirements
(e.g., ~30 min emergence to consumption based on simulation).

A possible solution is to have some specially purposed poison detector
userspace agent that proactively looks for memory errors by invoking some
ioctl specifically implemented to avoid the CPU erratum, minimize
performance interferences (cache pollution etc) and avoid leaking the
memory content into the registers. The detector agent runs with minimal
configurable resource consumption (e.g., 0.1 core / socket, <0.5% membw
consumption etc) and pauses itself when the host system is under heavy
load (e.g., CPU>90% or membw>75%).

The kernel ioctl may take the following form and a potential point of
discussion is whether Unmapping guest Private Memory will require zapping
the kernel direct map or not. This ioctl can be compiled off in case
incompatible with other use cases like UPM.

/* Could stop and return after the 1st poison is detected */
#define MCESCAN_IOCTL_SCAN 0

struct SysramRegion {
  /* input */
  uint64_t first_byte;   /* first page-aligned physical address to scan */
  uint64_t length;       /* page-aligned length of memory region to scan */
  /* output */
  uint32_t poisoned;     /* 1 - a poisoned page is found, 0 - otherwise */
  uint32_t poisoned_pfn; /* PFN of the 1st detected poisoned page */
}

1. https://lore.kernel.org/lkml/164529415398.16921.8042682039148828519.tip-bot2@tip-bot2/
2. https://lore.kernel.org/kvm/20220412223134.1736547-1-juew@xxxxxxxxxx/
3. https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/reduce-server-crash-rate-tencent-paper.pdf
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog