On Wed, Dec 06, 2023 at 08:50:23AM +0800, Huang, Ying wrote: > Gregory Price <gregory.price@xxxxxxxxxxxx> writes: > > > > From a complexity standpoint, it is exactly as complex as the hardware > > configuration itself: each socket has a different view of the memory > > topology. If you have a non-homogeneous memory configuration (e.g. a > > different number of CXL expanders on one socket thant he other), a flat > > array of weights has no way of capturing this hardware configuration. > > One important task of the software is to hide the complexity of hardware > from the users. At least it should provide the option. It only add > complexity based on real requirements. > The global weights are intended to help adminstrators hide that complexity from actual end-users. The administrator of a system should already be aware of the hardware configuration, however to hide this complexity a system service can be made which auto-configures these weights at system-bringup and on memory-device hostplug to simplify and hide the complexity even further. A system service can use ACPI HMAT (ACPI Heterogeneous Memory Attribute Table) information to automatically set the global weight information at boot time and/or on hotplug. Such extensions have already been proposed in prior RFCs and on the cxl mailing list. To break this down a little more explicitly into 6 example use-cases, lets consider the potential ways in which weighted interleave may be set via set_mempolicy or set_mempolicy2(). 1. Actual end-user software calls it directly (or through libnuma) a) they can call set_mempolicy() without task-weights and accept the administrator configured global weights b) they can call set_mempolicy2() with task-weights and use task-local defined weighting 2. Actual end-user uses `numactl -w[weights] --interleave ...` a) if weights are not defined, use global weights b) if weights are defined, use task-local weights 3. Administrator / Orchestrator opts user-software into weighted interleave by wrapping their software into `numactl -w --interleave` a) if weights are not defined, use global weights b) if weights are not defined, use task-local weights The most common use case is likely to be (3a) - an administrator opting a user-workload into weighted-interleave via `numactl -w --interleave` or an orchestrator such as kubernetes doing something similar on pod/container dispatching. In all cases where the user does not define weights, they are trusting the administrator (or system-daemon) set weights to provide the optimal distribution, removing the complexity of understanding the hardware environment from the end-user. In all cases where the user does define weights, they are accepting the complexity of understanding the hardware environment. On the topic of the ACTUAL complexity of system hardware that is being hidden, we must consider a non-homogeneous bandwidth environment. The simplest form is an off the shelf Intel 2-socket server with CXL memory expander. Lets Consider a 2 socket system with the following configuration:: DRAM on Socket0: 300GB/s local DRAM bandwidth (node 0) DRAM on Socket1: 300GB/s local DRAM bandwidth (node 1) CXL on socket0: 128GB/s bandwidth (node 2) CXL on socket1: 128GB/s bandwidth (node 3) A single linear array of weights is not sufficient to capture the complexities of bandwidth distributions on this system, because of the presence of a UPI link between socket0 and socket1, which changes the bandwidth distribution depending on where a task runs. For example, 3 links of UPI is 62.4GB/s full-duplex. >From the perspective of socket 0, the following is true: Bandwidth to Socket0 DRAM: 300GB/s (node 0) Bandwidth to Socket0 CXL: 100GB/s (node 2) Aggregate bandwidth to nodes (1,3): 62.4GB/s >From the perspective of socket 1, this changes to: Bandwidth to Socket0 DRAM: 300GB/s (node 1) Bandwidth to Socket0 CXL: 100GB/s (node 3) Aggregate bandwidth to nodes (0,2): 62.4GB/s With a single linear array of weights that apply to the entire system, you cannot represent this configuration. And in fact, a single configuration of weights will always provide a sub-optimal distribution. The goal of simplicity defeats the entire goal of weighted interleave in a heterogeneous environment. > > For these complex requirements, we will have process_set_mempolicy2(). > I think that it's even more flexible than the global matrix. > process_set_mempolicy2() has a *very* long road to exist. The problem of mempolicy reference counting is non-trivial, and the plumbing requires changes to no less than 4 subsystems. Beyond that, the complexity of actually using process_set_mempolicy2() is the same as any situation in which set_mempolicy2() with task-local weights set: The absolute highest. The global weighting matrix actually hides this complexity entirely. > -- > Best Regards, > Huang, Ying