Re: [RFC 07/10] platform/x86/intel/ifs: Create kthreads for online cpus for scan test

"Joseph, Jithu" <jithu.joseph@xxxxxxxxx> · Fri, 4 Mar 2022 11:20:10 -0800

On 3/2/2022 8:17 PM, Williams, Dan J wrote:
> On Tue, 2022-03-01 at 11:54 -0800, Jithu Joseph wrote:
>> Create scan kthreads for online logical cpus. Once scan test is triggered,
>> it wakes up the corresponding thread and its sibling threads to execute
>> the test. Once the scan test is done, the threads go back to thread wait
>> for next signal to start a new scan.
>>
...
>> +
>> +static int retry = 5;
>> +module_param_cb(retry, &ifs_retry_ops, &retry, 0644);
>> +
>> +MODULE_PARM_DESC(retry, "Maximum retry count when the test is not executed");
> 
> Why does this retry need to be in the kernel? Can't the test runner
> retry the test if they want? If it stays in the kernel, why a module
> parameter and not a sysfs attribute?

A core is tested by writing 1 to "runtest" file.  When user writes  a 1 to run_test file it tests all the subtests (chunks) on a core. 
Distinct from this, retry parameter describes the autoretry driver would do at  "chunk" granularity (bit more on why, is available in the doc)

Why not a sysfs attribute: good qn,  Our earlier prototype had this as a percpu sysfs attribute, however this was removed to keep the sysfs entries simple/minimal and less confusing.
(and tunable options which are not strictly needed in the normal course of use were moved to module parameters)
In the percpu sysfs we now only have the essential stuff i.e  run_test , status , and details making it simpler for user who wants to test the core.

> 
>> +
>> +static bool noint = 1;
>> +module_param(noint, bool, 0644);
>> +MODULE_PARM_DESC(noint, "Option to enable/disable interrupt during test");
> 
> Same sysfs vs module parameter question...

Same as above

> 

>> +
>> +static const char * const scan_test_status[] = {
>> +       "SCAN no error",
>> +       "Other thread could not join.",
>> +       "Interrupt occurred prior to SCAN coordination.",
>> +       "Core Abort SCAN Response due to power management condition.",
>> +       "Non valid chunks in the range",
>> +       "Mismatch in arguments between threads T0/T1.",
>> +       "Core not capable of performing SCAN currently",
>> +       "Unassigned error code 0x7",
>> +       "Exceeded number of Logical Processors (LP) allowed to run Scan-At-Field concurrently",
>> +       "Interrupt occurred prior to SCAN start",
> 
> This looks unmaintainable...
> 
> /me finds large comment block around IFS_* error codes and suggests
> killing 2 birds with one stone, i.e. delete that comment and make this
> self documenting:
> 
> static const char * const scan_test_status[] = {
>     [IFS_NO_ERROR] = "SCAN no error",
>     [IFS_OTHER_THREAD_DID_NOT_JOIN] = "Other thread could not join.",
>     ...etc...
> };

Will use this format.

> 
> Btw, which is it "did not join" and "could not join"? If the symbol
> name is going to be that long might as well make it match the log
> message verbatim.

Will make them identical. Thanks for pointing this out.

> 
> That way you can add / delete error codes without wondering if you
> managed to match the code number to the right position in the array.
> 
> Even better would be to kick this out of the kernel and let the user
> tool translate the error codes to test result log messages.
> 
>> +};
>> +
>> +static void message_not_tested(int cpu, union ifs_status status)
>> +{
>> +       if (status.error_code < ARRAY_SIZE(scan_test_status))
>> +               pr_warn("CPU %d: SCAN operation did not start. %s\n", cpu,
>> +                       scan_test_status[status.error_code]);
>> +       else if (status.error_code == IFS_SW_TIMEOUT)
>> +               pr_warn("CPU %d: software timeout during scan\n", cpu);
>> +       else if (status.error_code == IFS_SW_PARTIAL_COMPLETION)
>> +               pr_warn("CPU %d: %s\n", cpu,
>> +                       "Not all scan chunks were executed. Maximum forward progress retries exceeded");
> 
> Why are these codes not in the scan_test_status set? I see that
> IFS_SW_PARTIAL_COMPLETION and IFS_SW_TIMEOUT are defined with larger
> values, but why?

These are software(driver) defined error codes. Rest of the error codes are supplied by
the hardware. Software defined error codes were kept at the other end to provide ample space
in case (future) hardware decides to provide extend error codes.

> 
>> +       else
>> +               pr_warn("CPU %d: SCAN unknown status %llx\n", cpu, status.data);
>> +}
>> +
>> +static void message_fail(int cpu, union ifs_status status)
>> +{
>> +       if (status.control_error) {
>> +               pr_err("CPU %d: scan failed. %s\n", cpu,
>> +                      "Suggest reload scan file: # echo 1 > /sys/devices/system/cpu/ifs/reload");
>> +       }
>> +       if (status.signature_error) {
>> +               pr_err("CPU %d: test signature incorrect. %s\n", cpu,
>> +                      "Suggest retry scan to check if problem is transient");
>> +       }
> 
> This looks and feels more like tools/testing/selftests/ style code
> printing information for a kernel developer to read. For a production
> capability I would expect these debug messages to be elided and have an
> ifs user tool that knows when to "Suggest reload scan file". Otherwise,
> it's not scalable to use the kernel log buffer like a utility's stdout.

The two pr_err here are for really really grave errors and warrants being displayed to console,
possibly indicating some fault with the particular core. They are never expected to come in normal
course of testing on a working core.

But I see your larger point. We will convert all the pr_warn preceeding this (in message_not_tested())
to pr_dbg so that they dont normally take up the kernel log buffer. (they are not as grave a scenario as the earlier one).

The same information is also available from percpu sysfs/cpu#/ifs/status for user spaces tools to operate on

Jithu