On Thu, May 05 2022 at 18:40, Tony Luck wrote: > +/* > + * Note all code and data in this file is protected by > + * ifs_sem. On HT systems all threads on a core will > + * execute together, but only the first thread on the > + * core will update results of the test. > + */ > +struct workqueue_struct *ifs_wq; Seems to be unused. > +static bool oscan_enabled = true; What changes this? > +static void message_not_tested(struct device *dev, int cpu, union ifs_status status) > +{ > + if (status.error_code < ARRAY_SIZE(scan_test_status)) Please add curly brackets as these are not one-line statements. > + dev_info(dev, "CPU(s) %*pbl: SCAN operation did not start. %s\n", > + cpumask_pr_args(topology_sibling_cpumask(cpu)), > + scan_test_status[status.error_code]); > +/* > + * Execute the scan. Called "simultaneously" on all threads of a core > + * at high priority using the stop_cpus mechanism. > + */ > +static int doscan(void *data) > +{ > + int cpu = smp_processor_id(); > + u64 *msrs = data; > + int first; > + > + /* Only the first logical CPU on a core reports result */ > + first = cpumask_first(topology_sibling_cpumask(cpu)); Shouldn't that be cpu_smt_mask()? > + /* > + * This WRMSR will wait for other HT threads to also write > + * to this MSR (at most for activate.delay cycles). Then it > + * starts scan of each requested chunk. The core scan happens > + * during the "execution" of the WRMSR. This instruction can > + * take up to 200 milliseconds before it retires. 200ms per test chunk? > + */ > + wrmsrl(MSR_ACTIVATE_SCAN, msrs[0]); > + > + while (activate.start <= activate.stop) { > + if (time_after(jiffies, timeout)) { > + status.error_code = IFS_SW_TIMEOUT; > + break; > + } > + > + msrvals[0] = activate.data; > + stop_core_cpuslocked(cpu, doscan, msrvals); > + > + status.data = msrvals[1]; > + > + /* Some cases can be retried, give up for others */ > + if (!can_restart(status)) > + break; > + > + if (status.chunk_num == activate.start) { > + /* Check for forward progress */ > + if (retries-- == 0) { > + if (status.error_code == IFS_NO_ERROR) > + status.error_code = IFS_SW_PARTIAL_COMPLETION; > + break; > + } > + } else { > + retries = MAX_IFS_RETRIES; > + activate.start = status.chunk_num; > + } > + } Looks way better now. > +} > +/* > + * Initiate per core test. It wakes up work queue threads on the target cpu and > + * its sibling cpu. Once all sibling threads wake up, the scan test gets executed and > + * wait for all sibling threads to finish the scan test. > + */ > +int do_core_test(int cpu, struct device *dev) > +{ > + int ret = 0; > + > + if (!scan_enabled) > + return -ENXIO; > + > + /* Prevent CPUs from being taken offline during the scan test */ > + cpus_read_lock(); > + > + if (!cpu_online(cpu)) { > + dev_info(dev, "cannot test on the offline cpu %d\n", cpu); > + ret = -EINVAL; > + goto out; > + } Coming back to my points from the previous round: 1) How is that supposed to work on a system which has HT enabled in BIOS, but disabled on the kernel command line or via /sys/..../smt/control or when a HT sibling is offlined temporarily? I assume it cannot work, but I can't see anything which handles those cases. 2) That documentation for the admin/user got eaten by the gremlins in the intertubes again. Thanks, tglx