Re: [RFC] Kernel Support of Memory Error Detection.

Jiaqi Yan <jiaqiyan@xxxxxxxxxx> · Mon, 7 Nov 2022 18:24:04 -0800

On Thu, Nov 3, 2022 at 9:40 AM Nadav Amit <nadav.amit@xxxxxxxxx> wrote:
>
> On Nov 3, 2022, at 9:27 AM, Luck, Tony <tony.luck@xxxxxxxxx> wrote:
>
> >> - HPS usually doesn’t consume CPU cores but does consume memory
> >> controller cycles and memory bandwidth. SW consumes both CPU cycles
> >> and memory bandwidth, but is only a problem if administrators opt into
> >> the scanning after weighing the cost benefit.
> >
> > Maybe there is a middle ground on platforms that support some s/w programmable
> > DMA engine that can detect memory errors in a way that doesn't signal a
> > fatal system error. Your s/w scanner can direct that DMA engine to read from
> > the regions of memory that you want to scan, at a frequency that is compatible
> > with your system load requirements and risk assessments.
> >
> > If your idea gets traction, maybe structure the code so that it can either use
> > a CPU core scan a block of memory, or pass requests to a platform driver that can
> > use a DMA engine to perform the scan.
>
> That’s exactly what I was about the write. :)
>
> Quickassist can be perfect for that. The IOMMU can be programmed to make the
> memory uncachable.
>

Agree, the kernel code will abstract away the part that does the
actual memory scanning with an internal "API",
so that we can plug in different scanners, e.g. CPU, DMA device.

If it is feasible in future that hardware vendors can make patrol
scrubber programmable, we can even direct the scanning to patrol
scrubber.