Re: [RFC 00/10] Introduce In Field Scan driver

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Mar 15, 2022 at 8:27 AM Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Tue, Mar 15, 2022 at 02:59:03PM +0000, Luck, Tony wrote:
> > >> This seems a novel use of uevent ... is it OK, or is is abuse?
> > >
> > > Don't create "novel" uses of uevents.  They are there to express a
> > > change in state of a device so that userspace can then go and do
> > > something with that information.  If that pattern fits here, wonderful.
> >
> > Maybe Dan will chime in here to better explain his idea. I think for
> > the case where the core test fails, there is a good match with uevent.
> > The device (one CPU core) has changed state from "working" to
> > "untrustworthy". Userspace can do things like: take the logical CPUs
> > on that core offline, initiate a service call, or in a VMM cluster environment
> > migrate work to a different node.
>
> Again, I have no idea what you are doing at all with this driver, nor
> what you want to do with it.
>
> Start over please.
>
> What is the hardware you have to support?
>
> What is the expectation from userspace with regards to using the
> hardware?

Here is what I have learned about this driver since engaging on this
patch set. Cores go bad at run time. Datacenters can detect them at
scale. When I worked at Facebook there was an epic story of debugging
random user login failures that resulted in the discovery of a
marginal lot-number of CPUs in a certain cluster. In that case the
crypto instructions on a few cores of those CPUs gave wrong answers.
Whether that was an electromigration effect, or just a marginal bin of
CPUs, the only detection method was A-B testing different clusters of
CPUs to isolate the differences.

This driver takes advantage of a CPU feature to inject a diagnostic
test similar to what can be done via JTAG to validate the
functionality of a given core on a CPU at a low level. The diagnostic
is run periodically since some failures may be sensitive to thermals
while other failures may be be related to the lifetime of the CPU. The
result of the diagnostic is "here are 1 or more cores that may
miscalculate, stop using them and replace the CPU".

At a base level the ABI need only be something that conveys "core X
failed its last diagnostic". All the other details are just extra, and
in my opinion can be dropped save for maybe "core X was unable to run
the diagnostic".

The thought process that got me from the proposal on the table "extend
/sys/devices/system/cpu with per-cpu result state and other details"
to "emit uevents on each test completion" were the following:
-The complexity and maintenance burden of dynamically extending
/sys/devices/system/cpu: Given that you identified a reference
counting issue, I wondered why this was trying to use
/sys/devices/system/cpu in the first instance.

- The result of the test is an event that kicks off remediation
actions: When this fails a tech is paged to replace the CPU and in the
meantime the system can either be taken offline, or if some of the
cores are still good the workloads can be moved off of the bad cores
to keep some capacity online until the replacement can be made.

- KOBJ_CHANGE uevents are already deployed in NVME for AEN
(Asynchronous Event Notifications): If the results of the test were
conveyed only in sysfs then there would be a program that would scrape
sysfs and turn around and fire an event for the downstream remediation
actions. Uevent cuts to the chase and lets udev rule policy log,
notify, and/or take pre-emptive CPU offline action. The CPU state has
changed after a test run. It has either changed to a failed CPU, or it
has changed to one that has recently asserted its health.

> > > I doubt you can report "test results" via a uevent in a way that the
> > > current uevent states and messages would properly convey, but hey, maybe
> > > I'm wrong.
> >
> > But here things get a bit sketchy. Reporting "pass", or "didn't complete the test"
> > isn't a state change.  But it seems like a poor interface if there is no feedback
> > that the test was run. Using different methods to report pass/fail/incomplete
> > also seems user hostile.
>
> We have an in-kernel "test" framework.  Yes, it's for kernel code, but
> why not extend that to also include hardware tests?

This is where my head was at when starting out with this, but this is
more of an asynchronous error reporting mechanism like machine check,
or PCIe AER, than a test. The only difference being that the error in
this case is only reported by first requesting an error check. So it
is more similar to something like a background patrol scrub that seeks
out latent ECC errors in memory.



[Index of Archives]     [Linux Kernel Development]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux