On Tue 31-10-23 11:21:42, Johannes Weiner wrote: > On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote: > > On Mon 30-10-23 20:38:06, Gregory Price wrote: > > > This patchset implements weighted interleave and adds a new sysfs > > > entry: /sys/devices/system/node/nodeN/accessM/il_weight. > > > > > > The il_weight of a node is used by mempolicy to implement weighted > > > interleave when `numactl --interleave=...` is invoked. By default > > > il_weight for a node is always 1, which preserves the default round > > > robin interleave behavior. > > > > > > Interleave weights may be set from 0-100, and denote the number of > > > pages that should be allocated from the node when interleaving > > > occurs. > > > > > > For example, if a node's interleave weight is set to 5, 5 pages > > > will be allocated from that node before the next node is scheduled > > > for allocations. > > > > I find this semantic rather weird TBH. First of all why do you think it > > makes sense to have those weights global for all users? What if > > different applications have different view on how to spred their > > interleaved memory? > > > > I do get that you might have a different tiers with largerly different > > runtime characteristics but why would you want to interleave them into a > > single mapping and have hard to predict runtime behavior? > > > > [...] > > > In this way it becomes possible to set an interleaving strategy > > > that fits the available bandwidth for the devices available on > > > the system. An example system: > > > > > > Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket) > > > Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket) > > > Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex > > > Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex > > > > > > In this setup, the effective weights for nodes 0-3 for a task > > > running on Node 0 may be [60, 20, 10, 10]. > > > > > > This spreads memory out across devices which all have different > > > latency and bandwidth attributes at a way that can maximize the > > > available resources. > > > > OK, so why is this any better than not using any memory policy rely > > on demotion to push out cold memory down the tier hierarchy? > > > > What is the actual real life usecase and what kind of benefits you can > > present? > > There are two things CXL gives you: additional capacity and additional > bus bandwidth. > > The promotion/demotion mechanism is good for the capacity usecase, > where you have a nice hot/cold gradient in the workingset and want > placement accordingly across faster and slower memory. > > The interleaving is useful when you have a flatter workingset > distribution and poorer access locality. In that case, the CPU caches > are less effective and the workload can be bus-bound. The workload > might fit entirely into DRAM, but concentrating it there is > suboptimal. Fanning it out in proportion to the relative performance > of each memory tier gives better resuls. > > We experimented with datacenter workloads on such machines last year > and found significant performance benefits: > > https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@xxxxxxxxxxx/T/ Thanks, this is a useful insight. > This hopefully also explains why it's a global setting. The usecase is > different from conventional NUMA interleaving, which is used as a > locality measure: spread shared data evenly between compute > nodes. This one isn't about locality - the CXL tier doesn't have local > compute. Instead, the optimal spread is based on hardware parameters, > which is a global property rather than a per-workload one. Well, I am not convinced about that TBH. Sure it is probably a good fit for this specific CXL usecase but it just doesn't fit into many others I can think of - e.g. proportional use of those tiers based on the workload - you get what you pay for. Is there any specific reason for not having a new interleave interface which defines weights for the nodemask? Is this because the policy itself is very dynamic or is this more driven by simplicity of use? -- Michal Hocko SUSE Labs