On Fri, Mar 4, 2022 at 11:20 AM Joseph, Jithu <jithu.joseph@xxxxxxxxx> wrote: > > > > On 3/2/2022 8:17 PM, Williams, Dan J wrote: > > On Tue, 2022-03-01 at 11:54 -0800, Jithu Joseph wrote: > >> Create scan kthreads for online logical cpus. Once scan test is triggered, > >> it wakes up the corresponding thread and its sibling threads to execute > >> the test. Once the scan test is done, the threads go back to thread wait > >> for next signal to start a new scan. > >> > ... > >> + > >> +static int retry = 5; > >> +module_param_cb(retry, &ifs_retry_ops, &retry, 0644); > >> + > >> +MODULE_PARM_DESC(retry, "Maximum retry count when the test is not executed"); > > > > Why does this retry need to be in the kernel? Can't the test runner > > retry the test if they want? If it stays in the kernel, why a module > > parameter and not a sysfs attribute? > > A core is tested by writing 1 to "runtest" file. When user writes a 1 to run_test file it tests all the subtests (chunks) on a core. > Distinct from this, retry parameter describes the autoretry driver would do at "chunk" granularity (bit more on why, is available in the doc) > > Why not a sysfs attribute: good qn, Our earlier prototype had this as a percpu sysfs attribute, however this was removed to keep the sysfs entries simple/minimal and less confusing. > (and tunable options which are not strictly needed in the normal course of use were moved to module parameters) > In the percpu sysfs we now only have the essential stuff i.e run_test , status , and details making it simpler for user who wants to test the core. ...but you are putting it in sysfs, just in a different directory: /sys/module/ifs/parameters vs /sys/devices/{system/cpu/,platform}/ifs Just unify it all in one place, otherwise, I fail to see the simplification for the user over spreading settings over multiple locations. > > > > > >> + > >> +static bool noint = 1; > >> +module_param(noint, bool, 0644); > >> +MODULE_PARM_DESC(noint, "Option to enable/disable interrupt during test"); > > > > Same sysfs vs module parameter question... > > Same as above Same multiple sysfs ABI location concern. > > > > > >> + > >> +static const char * const scan_test_status[] = { > >> + "SCAN no error", > >> + "Other thread could not join.", > >> + "Interrupt occurred prior to SCAN coordination.", > >> + "Core Abort SCAN Response due to power management condition.", > >> + "Non valid chunks in the range", > >> + "Mismatch in arguments between threads T0/T1.", > >> + "Core not capable of performing SCAN currently", > >> + "Unassigned error code 0x7", > >> + "Exceeded number of Logical Processors (LP) allowed to run Scan-At-Field concurrently", > >> + "Interrupt occurred prior to SCAN start", > > > > This looks unmaintainable... > > > > /me finds large comment block around IFS_* error codes and suggests > > killing 2 birds with one stone, i.e. delete that comment and make this > > self documenting: > > > > static const char * const scan_test_status[] = { > > [IFS_NO_ERROR] = "SCAN no error", > > [IFS_OTHER_THREAD_DID_NOT_JOIN] = "Other thread could not join.", > > ...etc... > > }; > > Will use this format. > > > > > Btw, which is it "did not join" and "could not join"? If the symbol > > name is going to be that long might as well make it match the log > > message verbatim. > > Will make them identical. Thanks for pointing this out. > > > > > That way you can add / delete error codes without wondering if you > > managed to match the code number to the right position in the array. > > > > Even better would be to kick this out of the kernel and let the user > > tool translate the error codes to test result log messages. > > > >> +}; > >> + > >> +static void message_not_tested(int cpu, union ifs_status status) > >> +{ > >> + if (status.error_code < ARRAY_SIZE(scan_test_status)) > >> + pr_warn("CPU %d: SCAN operation did not start. %s\n", cpu, > >> + scan_test_status[status.error_code]); > >> + else if (status.error_code == IFS_SW_TIMEOUT) > >> + pr_warn("CPU %d: software timeout during scan\n", cpu); > >> + else if (status.error_code == IFS_SW_PARTIAL_COMPLETION) > >> + pr_warn("CPU %d: %s\n", cpu, > >> + "Not all scan chunks were executed. Maximum forward progress retries exceeded"); > > > > Why are these codes not in the scan_test_status set? I see that > > IFS_SW_PARTIAL_COMPLETION and IFS_SW_TIMEOUT are defined with larger > > values, but why? > > These are software(driver) defined error codes. Rest of the error codes are supplied by > the hardware. Software defined error codes were kept at the other end to provide ample space > in case (future) hardware decides to provide extend error codes. Why put them in the same number space? Separate software results from the raw hardware results and have a separate mechanism to convey each. > > > > >> + else > >> + pr_warn("CPU %d: SCAN unknown status %llx\n", cpu, status.data); > >> +} > >> + > >> +static void message_fail(int cpu, union ifs_status status) > >> +{ > >> + if (status.control_error) { > >> + pr_err("CPU %d: scan failed. %s\n", cpu, > >> + "Suggest reload scan file: # echo 1 > /sys/devices/system/cpu/ifs/reload"); > >> + } > >> + if (status.signature_error) { > >> + pr_err("CPU %d: test signature incorrect. %s\n", cpu, > >> + "Suggest retry scan to check if problem is transient"); > >> + } > > > > This looks and feels more like tools/testing/selftests/ style code > > printing information for a kernel developer to read. For a production > > capability I would expect these debug messages to be elided and have an > > ifs user tool that knows when to "Suggest reload scan file". Otherwise, > > it's not scalable to use the kernel log buffer like a utility's stdout. > > The two pr_err here are for really really grave errors and warrants being displayed to console, > possibly indicating some fault with the particular core. They are never expected to come in normal > course of testing on a working core. Kernel log messages with user action recommendations belong in a user tool. > But I see your larger point. We will convert all the pr_warn preceeding this (in message_not_tested()) > to pr_dbg so that they dont normally take up the kernel log buffer. (they are not as grave a scenario as the earlier one). > > The same information is also available from percpu sysfs/cpu#/ifs/status for user spaces tools to operate on "Suggest retry" does not seem like "grave error" to me. User feedback belongs in a user tool.