Re: [PATCH] acpi, nfit: skip ARS on machine-check-recovery capable platforms

Dan Williams <dan.j.williams@xxxxxxxxx> · Wed, 8 Feb 2017 15:01:53 -0800



On Wed, Feb 8, 2017 at 9:42 AM, Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
> On Wed, Feb 8, 2017 at 7:10 AM, Jeff Moyer <jmoyer@xxxxxxxxxx> wrote:
>> Dan Williams <dan.j.williams@xxxxxxxxx> writes:
>>
>>> If the platform supports machine-check-recovery then there is little
>>> reason to kick off opportunistic scrubs to collect a media error list.
>>> That initial scrub is only useful when it might prevent a kernel panic
>>> from consuming poison (a media error from memory).
>>
>> How expensive is the scrub?
>
> The ACPI spec is not clear, but it could range from benign to
> expensive and degrading system performance for 10's of minutes after
> boot
>
>> Even on platforms that support recoverable
>> machine checks, it's possible that you get one that is not recoverable.
>> You haven't sold me on this change.  ;-)
>>
>
> Adding Tony so he can either confirm, or point and laugh at my
> assumptions. In general you're right that there are machine check
> events that are not recoverable, but I'm thinking of problems like bus
> lockups and other disasters out of the direct cpu-to-memory data path.
> The question is whether should we avoid the cpu consuming media errors
> at all costs regardless of machine-check recovery. Tony might there be
> system-fatal gaps in memcpy_mcsafe() or userspace poison consumption
> handling that you would recommend aggressively trying to avoid media
> errors?
>

I was able to chat with Ashok and he warned that not all instructions
that consume poison can generate a recovery point. So, thanks for
prompting the double-check, we should definitely try to collect the
badblocks list regardless of the machine check recovery capability of
the system.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html