Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control

Gregory Price <gregory.price@xxxxxxxxxxxx> · Tue, 5 Dec 2023 21:01:53 -0500

On Wed, Dec 06, 2023 at 08:50:23AM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@xxxxxxxxxxxx> writes:
> >
> > From a complexity standpoint, it is exactly as complex as the hardware
> > configuration itself:  each socket has a different view of the memory
> > topology. If you have a non-homogeneous memory configuration (e.g. a 
> > different number of CXL expanders on one socket thant he other), a flat
> > array of weights has no way of capturing this hardware configuration.
> 
> One important task of the software is to hide the complexity of hardware
> from the users.  At least it should provide the option.  It only add
> complexity based on real requirements.
> 

The global weights are intended to help adminstrators hide that
complexity from actual end-users.

The administrator of a system should already be aware of the hardware
configuration, however to hide this complexity a system service can 
be made which auto-configures these weights at system-bringup and on
memory-device hostplug to simplify and hide the complexity even further.

A system service can use ACPI HMAT (ACPI Heterogeneous Memory Attribute
Table) information to automatically set the global weight information at
boot time and/or on hotplug. Such extensions have already been proposed
in prior RFCs and on the cxl mailing list.

To break this down a little more explicitly into 6 example use-cases,
lets consider the potential ways in which weighted interleave may be
set via set_mempolicy or set_mempolicy2().

1. Actual end-user software calls it directly (or through libnuma)
   a) they can call set_mempolicy() without task-weights and accept the
      administrator configured global weights
   b) they can call set_mempolicy2() with task-weights and use task-local
      defined weighting
2. Actual end-user uses `numactl -w[weights] --interleave ...`
   a) if weights are not defined, use global weights
   b) if weights are defined, use task-local weights
3. Administrator / Orchestrator opts user-software into weighted
   interleave by wrapping their software into `numactl -w --interleave`
   a) if weights are not defined, use global weights
   b) if weights are not defined, use task-local weights

The most common use case is likely to be (3a) - an administrator opting
a user-workload into weighted-interleave via `numactl -w --interleave`
or an orchestrator such as kubernetes doing something similar on
pod/container dispatching.

In all cases where the user does not define weights, they are trusting
the administrator (or system-daemon) set weights to provide the optimal
distribution, removing the complexity of understanding the hardware
environment from the end-user.

In all cases where the user does define weights, they are accepting the
complexity of understanding the hardware environment.

On the topic of the ACTUAL complexity of system hardware that is being
hidden, we must consider a non-homogeneous bandwidth environment. The
simplest form is an off the shelf Intel 2-socket server with CXL memory
expander.

Lets Consider a 2 socket system with the following configuration::

DRAM on Socket0:  300GB/s local DRAM bandwidth  (node 0)
DRAM on Socket1:  300GB/s local DRAM bandwidth  (node 1)
CXL on socket0:   128GB/s bandwidth             (node 2)
CXL on socket1:   128GB/s bandwidth             (node 3)

A single linear array of weights is not sufficient to capture the
complexities of bandwidth distributions on this system, because
of the presence of a UPI link between socket0 and socket1, which
changes the bandwidth distribution depending on where a task runs.

For example, 3 links of UPI is 62.4GB/s full-duplex.

>From the perspective of socket 0, the following is true:

Bandwidth to Socket0 DRAM:  300GB/s    (node 0)
Bandwidth to Socket0 CXL:   100GB/s    (node 2)
Aggregate bandwidth to nodes (1,3):  62.4GB/s

>From the perspective of socket 1, this changes to:
Bandwidth to Socket0 DRAM:  300GB/s    (node 1)
Bandwidth to Socket0 CXL:   100GB/s    (node 3)
Aggregate bandwidth to nodes (0,2):  62.4GB/s

With a single linear array of weights that apply to the entire system,
you cannot represent this configuration.  And in fact, a single
configuration of weights will always provide a sub-optimal distribution.

The goal of simplicity defeats the entire goal of weighted interleave in
a heterogeneous environment.

> 
> For these complex requirements, we will have process_set_mempolicy2().
> I think that it's even more flexible than the global matrix.
>

process_set_mempolicy2() has a *very* long road to exist. The problem of
mempolicy reference counting is non-trivial, and the plumbing requires
changes to no less than 4 subsystems.

Beyond that, the complexity of actually using process_set_mempolicy2()
is the same as any situation in which set_mempolicy2() with task-local
weights set:  The absolute highest.

The global weighting matrix actually hides this complexity entirely.

> --
> Best Regards,
> Huang, Ying