Gregory Price <gourry.memverge@xxxxxxxxx> writes: > This patchset implements weighted interleave and adds a new cgroup > sysfs entry: cgroup/memory.interleave_weights (excluded from root). > > The il_weight of a node is used by mempolicy to implement weighted > interleave when `numactl --interleave=...` is invoked. By default > il_weight for a node is always 1, which preserves the default round > robin interleave behavior. IIUC, this makes it almost impossible to set the default weight of a node from the node memory bandwidth information. This will make the life of users a little harder. If so, how about use a new memory policy mode, for example MPOL_WEIGHTED_INTERLEAVE, etc. > Interleave weights denote the number of pages that should be > allocated from the node when interleaving occurs and have a range > of 1-255. The weight of a node can never be 0, and instead the > preferred way to prevent allocation is to remove the node from the > cpuset or mempolicy altogether. > > For example, if a node's interleave weight is set to 5, 5 pages > will be allocated from that node before the next node is scheduled > for allocations. > > # Set node weight for node 0 to 5 > echo 0:5 > /sys/fs/cgroup/user.slice/memory.interleave_weights > > # Set node weight for node 1 to 3 > echo 1:3 > /sys/fs/cgroup/user.slice/memory.interleave_weights > > # View the currently set weights > cat /sys/fs/cgroup/user.slice/memory.interleave_weights > 0:5,1:3 > > Weights will only be displayed for possible nodes. > > With this it becomes possible to set an interleaving strategy > that fits the available bandwidth for the devices available on > the system. An example system: > > Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket) > Node 1 - CXL Memory. 64GB/s BW, on Node 0 root complex > > In this setup, the effective weights for a node set of [0,1] > may be may be [86, 14] (86% of memory on Node 0, 14% on node 1) > or some smaller fraction thereof to encourge quicker rounds > for better overall distribution. > > This spreads memory out across devices which all have different > latency and bandwidth attributes in a way that can maximize the > available resources. > -- Best Regards, Huang, Ying