Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave

Ravi Jonnalagadda <ravis.opensrc@xxxxxxxxxx> · Wed, 1 Nov 2023 14:59:23 +0530

>> On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote:
>>> On Tue 31-10-23 11:21:42, Johannes Weiner wrote:
>>> > On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote:
>>> > > On Mon 30-10-23 20:38:06, Gregory Price wrote:
>
>[snip]
>
>>>
>>> > This hopefully also explains why it's a global setting. The usecase is
>>> > different from conventional NUMA interleaving, which is used as a
>>> > locality measure: spread shared data evenly between compute
>>> > nodes. This one isn't about locality - the CXL tier doesn't have local
>>> > compute. Instead, the optimal spread is based on hardware parameters,
>>> > which is a global property rather than a per-workload one.
>>>
>>> Well, I am not convinced about that TBH. Sure it is probably a good fit
>>> for this specific CXL usecase but it just doesn't fit into many others I
>>> can think of - e.g. proportional use of those tiers based on the
>>> workload - you get what you pay for.
>>>
>>> Is there any specific reason for not having a new interleave interface
>>> which defines weights for the nodemask? Is this because the policy
>>> itself is very dynamic or is this more driven by simplicity of use?
>>
>> A downside of *requiring* weights to be paired with the mempolicy is
>> that it's then the application that would have to figure out the
>> weights dynamically, instead of having a static host configuration. A
>> policy of "I want to be spread for optimal bus bandwidth" translates
>> between different hardware configurations, but optimal weights will
>> vary depending on the type of machine a job runs on.
>>
>> That doesn't mean there couldn't be usecases for having weights as
>> policy as well in other scenarios, like you allude to above. It's just
>> so far such usecases haven't really materialized or spelled out
>> concretely. Maybe we just want both - a global default, and the
>> ability to override it locally.
>
>I think that this is a good idea.  The system-wise configuration with
>reasonable default makes applications life much easier.  If more control
>is needed, some kind of workload specific configuration can be added.

Glad that we are in agreement here. For bandwidth expansion use cases
that this interleave patchset is trying to cater to, most applications
would have to follow the "reasanable defaults" for weights.
The necessity for applications to choose different weights while
interleaving would probably be to do capacity expansion which the
default memory tiering implementation would anyway support and provide
better latency.

>And, instead of adding another memory policy, a cgroup-wise
>configuration may be easier to be used.  The per-workload weight may
>need to be adjusted when we deploying different combination of workloads
>in the system.
>
>Another question is that should the weight be per-memory-tier or
>per-node?  In this patchset, the weight is per-source-target-node
>combination.  That is, the weight becomes a matrix instead of a vector.
>IIUC, this is used to control cross-socket memory access in addition to
>per-memory-type memory access.  Do you think the added complexity is
>necessary?

Pros and Cons of Node based interleave:
Pros:
1. Weights can be defined for devices with different bandwidth and latency
characteristics individually irrespective of which tier they fall into.
2. Defining the weight per-source-target-node would be necessary for multi
socket systems where few devices may be closer to one socket rather than other.
Cons:
1. Weights need to be programmed for all the nodes which can be tedious for
systems with lot of NUMA nodes.

Pros and Cons of Memory Tier based interleave:
Pros:
1. Programming weight per initiator would apply for all the nodes in the tier.
2. Weights can be calculated considering the cumulative bandwidth of all
the nodes in the tier and need to be programmed once for all the nodes in a
given tier.
3. It may be useful in cases where numa nodes with similar latency and bandwidth
characteristics increase, possibly with pooling use cases.
Cons:
1. If nodes with different bandwidth and latency characteristics are placed
in same tier as seen in the current mainline kernel, it will be difficult to
apply a correct interleave weight policy.
2. There will be a need for functionality to move nodes between different tiers
or create new tiers to place such nodes for programming correct interleave weights.
We are working on a patch to support it currently.
3. For systems where each numa node is having different characteristics,
a single node might end up existing in different memory tier, which would be
equivalent to node based interleaving. On newer systems where all CXL memory
from different devices under a port are combined to form single numa node, this
scenario might be applicable.
4. Users may need to keep track of different memory tiers and what nodes are present
in each tier for invoking interleave policy.

>
>> Could you elaborate on the 'get what you pay for' usecase you
>> mentioned?
>
>--
>Best Regards,
>Huang, Ying
--
Best Regards,
Ravi Jonnalagadda