Michal Hocko <mhocko@xxxxxxxx> writes: > On Tue 31-10-23 11:21:42, Johannes Weiner wrote: >> On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote: >> > On Mon 30-10-23 20:38:06, Gregory Price wrote: >> > > This patchset implements weighted interleave and adds a new sysfs >> > > entry: /sys/devices/system/node/nodeN/accessM/il_weight. >> > > >> > > The il_weight of a node is used by mempolicy to implement weighted >> > > interleave when `numactl --interleave=...` is invoked. By default >> > > il_weight for a node is always 1, which preserves the default round >> > > robin interleave behavior. >> > > >> > > Interleave weights may be set from 0-100, and denote the number of >> > > pages that should be allocated from the node when interleaving >> > > occurs. >> > > >> > > For example, if a node's interleave weight is set to 5, 5 pages >> > > will be allocated from that node before the next node is scheduled >> > > for allocations. >> > >> > I find this semantic rather weird TBH. First of all why do you think it >> > makes sense to have those weights global for all users? What if >> > different applications have different view on how to spred their >> > interleaved memory? >> > >> > I do get that you might have a different tiers with largerly different >> > runtime characteristics but why would you want to interleave them into a >> > single mapping and have hard to predict runtime behavior? >> > >> > [...] >> > > In this way it becomes possible to set an interleaving strategy >> > > that fits the available bandwidth for the devices available on >> > > the system. An example system: >> > > >> > > Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket) >> > > Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket) >> > > Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex >> > > Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex >> > > >> > > In this setup, the effective weights for nodes 0-3 for a task >> > > running on Node 0 may be [60, 20, 10, 10]. >> > > >> > > This spreads memory out across devices which all have different >> > > latency and bandwidth attributes at a way that can maximize the >> > > available resources. >> > >> > OK, so why is this any better than not using any memory policy rely >> > on demotion to push out cold memory down the tier hierarchy? >> > >> > What is the actual real life usecase and what kind of benefits you can >> > present? >> >> There are two things CXL gives you: additional capacity and additional >> bus bandwidth. >> >> The promotion/demotion mechanism is good for the capacity usecase, >> where you have a nice hot/cold gradient in the workingset and want >> placement accordingly across faster and slower memory. >> >> The interleaving is useful when you have a flatter workingset >> distribution and poorer access locality. In that case, the CPU caches >> are less effective and the workload can be bus-bound. The workload >> might fit entirely into DRAM, but concentrating it there is >> suboptimal. Fanning it out in proportion to the relative performance >> of each memory tier gives better resuls. >> >> We experimented with datacenter workloads on such machines last year >> and found significant performance benefits: >> >> https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@xxxxxxxxxxx/T/ > > Thanks, this is a useful insight. > >> This hopefully also explains why it's a global setting. The usecase is >> different from conventional NUMA interleaving, which is used as a >> locality measure: spread shared data evenly between compute >> nodes. This one isn't about locality - the CXL tier doesn't have local >> compute. Instead, the optimal spread is based on hardware parameters, >> which is a global property rather than a per-workload one. > > Well, I am not convinced about that TBH. Sure it is probably a good fit > for this specific CXL usecase but it just doesn't fit into many others I > can think of - e.g. proportional use of those tiers based on the > workload - you get what you pay for. For "pay", per my understanding, we need some cgroup based per-memory-tier (or per-node) usage limit. The following patchset is the first step for that. https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@xxxxxxxxxxxxxxx/ -- Best Regards, Huang, Ying