RE: [PATCH] acpi, nfit: skip ARS on machine-check-recovery capable platforms

"Luck, Tony" <tony.luck@xxxxxxxxx> · Thu, 9 Feb 2017 17:20:56 +0000

> Adding Tony so he can either confirm, or point and laugh at my
> assumptions. In general you're right that there are machine check
> events that are not recoverable, but I'm thinking of problems like bus
> lockups and other disasters out of the direct cpu-to-memory data path.
> The question is whether should we avoid the cpu consuming media errors
> at all costs regardless of machine-check recovery. Tony might there be
> system-fatal gaps in memcpy_mcsafe() or userspace poison consumption
> handling that you would recommend aggressively trying to avoid media
> errors?

TL;DR - I think it is worth it ... but I worry more about errors than most
people.

In current generation systems the two most common sources of machine
checks are memory, and I/O.  They dwarf all the others like cache and
bus lockups.  So it is worth trying to avoid memory issues.

Whether you can recover from a machine check triggered from a CPU
read of memory depends on which instructions you use, and the alignment
of the access. That's why memcpy_mcsafe() will start with a few byte reads
if needed to align the source address while other copy routines prefer to
align the destination ... memory writes that straddle cache lines are more
expensive than reads that do that ... but the point of the routine is to be
safe, so we drop a tiny amount of performance in the unaligned case to
make sure we will be able to recover.

We can't control how userspace will access memory ... so if we can find
the errors before they stumble into them it is a win.

-Tony
��.n��������+%������w��{.n�����{�����ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f