On 3/7/22 09:46, Luck, Tony wrote:
These are software(driver) defined error codes. Rest of the error codes are supplied by
the hardware. Software defined error codes were kept at the other end to provide ample space
in case (future) hardware decides to provide extend error codes.
Why put them in the same number space? Separate software results from
the raw hardware results and have a separate mechanism to convey each.
We wanted to include in the "details" file, which is otherwise a direct copy of
the SCAN_STATUS MSR. Making sure the software error codes didn't overlap
with any h/w generated codes seemed like a good idea.
But maybe we should have done this with additional string values in the status
file:
Current:
pass
untested
fail
Add a couple of new options for the s/w cases:
sw_timeout
sw_retries_exceeded
We've made a userspace implementation for this API already as part of
opendcdiag that uses it:
https://github.com/opendcdiag/opendcdiag/commit/0cbfcee30e0666b0f79a2e452d7f8167d2a0cb90
What I really like is that with this proposed API, we can unambiguously
determine whether "the core failed" or "everything is fine, for now" by
reading a single file. I hate to see this file become unusable because
its content changes from "pass" to "sw_timeout" or, even worse, it
changes from "fail" to "sw_timeout". That would render it useless for
the purpose that I think our users will be looking at it.
So, my preference would be to keep this file functioning as-is in this
patch series.
I would think that some sort of expandable "statistics" file would be a
better way to output various metrics:
```
sw_timeout: 0
sw_retries_exceeded: 2
runs: 42
first_run: 1405529347
last_run: 1646948140
<etc..>
```
just as a suggested alternative for more/incompatble output values or a
complex, dynamic format.
I don't have any use in opendcdiag for these values and data. If someone
does, they should want to chime in perhaps.
Auke