Hi Maciej, On 2/2/2024 2:17 AM, Maciej Wieczor-Retman wrote: > On 2024-02-01 at 11:47:44 -0800, Reinette Chatre wrote: >> Hi Maciej, >> >> On 1/31/2024 4:55 AM, Maciej Wieczor-Retman wrote: >>> On 2024-01-26 at 13:10:18 -0800, Reinette Chatre wrote: >>>> On 1/25/2024 3:13 AM, Maciej Wieczor-Retman wrote: >>>>> + if (sparse_masks != ((ecx >> 3) & 1)) { >>>>> + ksft_print_msg("CPUID output doesn't match 'sparse_masks' file content!\n"); >>>>> + return -1; >>>> >>>> If I understand correctly this falls into the "test failure" [1] category >>>> and should return 1? ... >>>> >>>>> + } >>>>> + >>>>> + /* Write checks initialization. */ >>>>> + ret = get_full_cbm(test->resource, &full_cache_mask); >>>>> + if (ret < 0) >>>>> + return ret; >>>>> + bit_center = count_bits(full_cache_mask) / 2; >>>>> + cont_mask = full_cache_mask >> bit_center; >>>>> + >>>>> + /* Contiguous mask write check. */ >>>>> + snprintf(schemata, sizeof(schemata), "%lx", cont_mask); >>>>> + ret = write_schemata("", schemata, uparams->cpu, test->resource); >>>>> + if (ret) { >>>>> + ksft_print_msg("Write of contiguous CBM failed\n"); >>>>> + return ret; >>>> >>>> ... although here I think the goal to distinguish between test error and test failure >>>> falls apart since it is not possible to tell within the test if the failure is >>>> because of error in the test or if test failed. >>> >>> Is there even a distinction between test error and failure in resctrl selftest? >> >> There is such a distinction in the current tests (and from what I understand the reason >> behind the logical XOR used in this test) . In existing tests the running of >> the test precedes and is clearly separate from determining of the test pass/fail. >> All the current tests have a clear "run the test" phase where data is collected to >> a file, followed by an analysis (aka "check results") phase that looks at collected >> data to determine if the test passes or fails. >> Note how all the "check results" return either 0 or 1 to indicate test pass >> or fail respectively. Specifically, you can refer to: >> mbm_test.c->check_results() >> mba_test.c->check_results() >> cmt_test.c->check_results() >> cat_test.c->check_results() >> >>> I've been looking at it for a while and can't find any instances where >>> ksft_test_result_error() would be used. Everywhere I look it's either pass or >>> fail. By grep-ing over all selftests I found only five tests that use >>> ksft_test_result_error(). >> >> Yes, from the user perspective there is no such distinction. This seems to >> be entirely internal to the resctrl selftests (but I do not think that this >> should or can be a hard requirement). > > Okay, thank you, that's what I wanted to know. > >> >>> >>> Furthermore there is this one "TODO" in kselftests.h: >>> >>> /* TODO: how does "error" differ from "fail" or "skip"? */ >>> >>> If you meant the distintion less literally then I'd say the sparse_masks >>> comparison to CPUID would be a failure. What I had in mind is that it tries to >>> validate a resctrl interface relevant to non-contiguous CBMs. If it fails >>> there is probably something wrong with the code concerning non-contiguous CBMs. >> >> Wrong with which code? As I understand this particular check compares the >> resctrl view of the world to the hardware realities. If this check fails >> then I do not think this is an issue with the test code (which would make it a test >> error) but instead a resctrl bug and thus a test failure. > > I also meant a resctrl bug. I was thinking about the kernel resctrl code that > handles taking the CPUID information about non-contiguous CBMs and putting it in > the sparse_masks file. > > If there was a hardware problem and CPUID returned wrong information, then the > check wouldn't fail as sparse_masks relies on CPUID too and both values would > match. So in view of this I thought that this check could make sure that the > resctrl kernel code handles CPUID returned information properly. > > So should this check be moved from the "run the test" phase to the end of the > function ("check results" phase) to signify that it's not an error but a > failure? I do not think this test matches the "run" and "check" phases of previous tests, unless you create a new test for every scenario checked within this test. Just returning 1 when the check (if (sparse_masks != ((ecx >> 3) & 1))) fails should be ok, no? >>> On the other hand writing contiguous CBMs shouldn't fail as far as the >>> non-contiguous CBMs in CAT test is concerned. So if that fails there might be >>> something wrong on a higher level and I'd say that can be more of an error than >>> a failure. >> >> I think that the write_schemata() can fail for a variety of reasons, some may >> indicate an issue with the test while some may indicate an issue with resctrl. >> It is not possible for the caller of write_schemata() to distinguish. >> >>> But I'm just saying how I undestood it so far. If there is some clear >>> distinction between error and failure definitions I could try to separate it >>> more explicitly. >> >> I do not think it is possible to clearly distinguish between error and failure. >> These are already lumped together as a ksft_test_result_fail() anyway so no >> risk of confusion to folks just running the tests. >> I think the final test result may be confusing to folks parsing the >> resctrl selftest internals: >> >> run_single_test() >> { >> ... >> ret = test->run_test(test, uparams); >> ksft_test_result(!ret, "%s: test\n", test->name); >> ... >> } >> >> above means that a test returning negative or greater than zero value is >> considered a test failure and resctrl tests may return either in the case of >> an actual test failure ... but from user perspective there is no difference >> so I do not think it is an issue, just lack of consistency in the resctrl >> test internals in cases like write_schemata() failure where a possible >> test fail is captured as a test error. >> >> I do not think it is required to be strict here. Keeping "test returns >> negative or greater than zero on test failure" seems reasonable to me. > > Okay, so the approach I applied in noncont_cat_run_test() with write_schemata() > is acceptable? In general I'd say a write_schemata() failure's return code will be acceptable, but you should be consistent in this test. There are two write_schemata() calls in this test, one treats an error return as a failure and the other treats an error return as an error. Considering this inconsistency I would thus rather suggest that you always treat write_schemata() error return as a test failure. Reinette