On Fri, 6 Sep 2024, Reinette Chatre wrote: > On 9/6/24 3:00 AM, Ilpo Järvinen wrote: > > On Thu, 5 Sep 2024, Reinette Chatre wrote: > > > On 9/5/24 5:10 AM, Ilpo Järvinen wrote: > > > > On Wed, 4 Sep 2024, Reinette Chatre wrote: > > > > > On 9/4/24 4:57 AM, Ilpo Järvinen wrote: > > > > > > On Fri, 30 Aug 2024, Reinette Chatre wrote: > > > > > > > On 8/30/24 3:56 AM, Ilpo Järvinen wrote: > > > > > > > > On Thu, 29 Aug 2024, Reinette Chatre wrote: > > > > > > > > > > > > > @@ -699,111 +639,80 @@ int resctrl_val(const struct > > > > > > > > > resctrl_test > > > > > > > > > *test, > > > > > > > > > return ret; > > > > > > > > > } > > > > > > > > > - /* > > > > > > > > > - * If benchmark wasn't successfully started by child, > > > > > > > > > then > > > > > > > > > child > > > > > > > > > should > > > > > > > > > - * kill parent, so save parent's pid > > > > > > > > > - */ > > > > > > > > > ppid = getpid(); > > > > > > > > > - if (pipe(pipefd)) { > > > > > > > > > - ksft_perror("Unable to create pipe"); > > > > > > > > > + /* Taskset test to specified CPU. */ > > > > > > > > > + ret = taskset_benchmark(ppid, uparams->cpu, > > > > > > > > > &old_affinity); > > > > > > > > > > > > > > > > Previously only CPU affinity for bm_pid was set but now it's set > > > > > > > > before > > > > > > > > fork(). Quickly checking the Internet, it seems that CPU > > > > > > > > affinity > > > > > > > > gets > > > > > > > > inherited on fork() so now both processes will have the same > > > > > > > > affinity > > > > > > > > which might make the other process to interfere with the > > > > > > > > measurement. > > > > > > > > > > > > > > Setting the affinity is intended to ensure that the buffer > > > > > > > preparation > > > > > > > occurs in the same topology as where the runtime portion will run. > > > > > > > This preparation is done before the work to be measured starts. > > > > > > > > > > > > > > This does tie in with the association with the resctrl group and I > > > > > > > will elaborate more below ... > > > > > > > > > > > > Okay, that's useful to retain but thinking this further, now we're > > > > > > also > > > > > > going do non-trivial amount of work in between the setup and the > > > > > > test by > > > > > > > > > > Could you please elaborate how the amount of work during setup can be > > > > > an > > > > > issue? I have been focused on the measurements that are done > > > > > afterwards > > > > > that do have clear boundaries from what I can tell. > > > > > > > > Well, you used it as a justification: "Setting the affinity is intended > > > > to ensure that the buffer preparation occurs in the same topology as > > > > where > > > > the runtime portion will run." So I assumed you had some expectations > > > > about > > > > "preparations" done outside of those "clear boundaries" but now you seem > > > > to take entirely opposite stance? > > > > > > I do not follow you here. With the "clear boundaries" I meant the > > > measurements and associated variables that have a clear start/init and > > > stop/read while the other task sleeps. Yes, preparations are done outside > > > of this since that should not be measured. > > > > Of course the preparations are not measured (at least not after this > > patch :-)). > > > > But that's not what I meant. You said the preparations are to be done > > using the same topology as the test but if it's outside of the measurement > > period, the topology during preparations only matters if you make some > > hidden assumption that some state carries from preparations to the actual > > test. If no state carry-over is assumed, the topology during preparations > > is not important. So which way it is? Is the topology during preparations > > important or not? > > Topology during preparations is important. > > In the CMT test this is more relevant with the transition to using > memflush = false. The preparation phase and measure phase uses the same > cache alloc configuration and just as importantly, the same monitoring > configuration. During preparation the cache portion that will be > used during measurement will be filled with the contents of the buffer. > During measurement it will be the same cache portion into which any new reads > will be allocated and measured. In fact, the preparation phase will thus form > part of the measurement. If preparation was done with different > configuration, then I see a problem whether memflush = true as well as when > memflush = false. In the case of memflush = true it will have the familiar > issue of the test needing to "guess" the workload settle time. In the case > of memflush = false the buffer will remain allocated into the cache portion > used during preparation phase, when the workload runs it will read the > data from a cache portion that does not belong to it and since it does > not need to allocate into its own cache portion its LLC occupancy counts will > not be accurate (the LLC occupancy associated with the buffer will be > attributed > to prepare portion). > > I am not familiar with the details of memory allocation but as I understand > Linux does attempt to satisfy memory requests from the node to which the > requesting CPU is assigned. For the MBM and MBA tests I thus believe it is > important to allocate the memory from where it will be used. I have > encountered > systems where CPU0 and CPU1 are on different sockets and by default the > workload > is set to run on CPU1. If the preparation phase runs on CPU0 then it may be > that memory could be allocated from a different NUMA node than where the > workload will > be running. By doing preparation within the same topology as what the > workload will be running I believe that memory will be allocated appropriate > to workload and thus result in more reliable measurements. > > > > > You used the topology argument to justify why both parent and child are > > now in the same resctrl group unlike before when only the actual test was. > > > > > You stated "now we're also going > > > do non-trivial amount of work in between the setup and the test" ... > > > could you clarify what the problem is with this? Before this work > > > the "non-trivial amount of work" (for "fill_buf") was done as part of the > > > test with (wrong) guesses about how long it takes. This work aims to > > > improve > > > this. > > > > I understand why you're trying to do with this change. > > > > However, I was trying to say that before preparations occurred right > > before the test which is no longer the case because there's fork() in > > between. > > If by "test" you mean the measurement phase then in the case of "fill_buf" > preparations only now reliably occur before the measurement. Original behavior > is maintained with user provided benchmark. > > > > > Does that extra work impact the state carry-over from preparations to the > > test? > > It is not clear to me what extra work or state you are referring to. > > > > > > > fork() quite heavy operation as it has to copy various things including > > > > the address space which you just made to contain a huge mem blob. :-) > > > > > > As highlighted in a comment found in the patch, the kernel uses > > > copy-on-write > > > and all tests only read from the buffer after fork(). > > > > Write test is possible using -b fill_buf ... as mentioned in comments to > > the other patch. > > Yes, it is theoretically possible, but looking closer it is not supported. > Note > how measure_mem_bw() is always called with hardcoded "reads". Reading through > the history of commits I do not believe modifying fill_buf parameters was > ever intended to be a use case for the "-b" parameter. It seems intended > to provide an alternate benchmark to fill_buf. > > I am not interested in adding support for the write test because I do not see > how it helps us to improve resctrl subsystem health. Considering that > nobody has ever complained about the write test being broken I think all > that dead code should just be removed instead. > > I prefer the focus be on the health of the resctrl subsystem instead of > additional > hardware performance testing. I do not think it is energy well spent to > further > tune these performance tests unless it benefits resctrl subsystem health. > These > performance tests are difficult to maintain and I have not yet seen how they > benefit the resctrl subsystem. > > > > > BTW, perhaps we could use some lighter weighted fork variant in the > > > > resctrl selftests, the processes don't really need to be that separated > > > > to justify using full fork() (and custom benchmarks will do execvp()). > > > > > > Are you thinking about pthreads? It is not clear to me that this is > > > necessary. It is also not clear to me what problem you are describing that > > > needs this solution. When I understand that better it will be easier to > > > discuss solutions. > > > > I was trying to say I see advantage of retaining the address space which > > is not possible with fork(), so perhaps using clone() with CLONE_VM would > > be useful and maybe some other flags too. I think this would solve the > > write test case. > > clone() brings with it the complexity of needing to manage the child's > stack. This again aims for a performance improvement. What does this fix? > What is the benefit to resctrl health? I would prefer that our energy > instead be focused on improving resctrl subsystem health. Fair enough. -- i.