On Fri, 14 Mar 2025 18:08:35 +0800 "Huang, Ying" <ying.huang@xxxxxxxxxxxxxxxxx> wrote: > Joshua Hahn <joshua.hahnjy@xxxxxxxxx> writes: > > > On Thu, 9 Jan 2025 13:50:48 -0500 Joshua Hahn <joshua.hahnjy@xxxxxxxxx> wrote: > > > >> Hello everyone, I hope everyone has had a great start to 2025! > >> > >> Recently, I have been working on a patch series [1] with > >> Gregory Price <gourry@xxxxxxxxxx> that provides new default interleave > >> weights, along with dynamic re-weighting on hotplug events and a series > >> of UAPIs that allow users to configure how they want the defaults to behave. > >> > >> In introducing these new defaults, discussions have opened up in the > >> community regarding how best to create a UAPI that can provide > >> coherent and transparent interactions for the user. In particular, consider > >> this scenario: when a hotplug event happens and a node comes online > >> with new bandwidth information (and therefore changing the bandwidth > >> distributions across the system), should user-set weights be overwritten > >> to reflect the new distributions? If so, how can we justify overwriting > >> user-set values in a sysfs interface? If not, how will users manually > >> adjust the node weights to the optimal weight? > >> > >> I would like to revisit some of the design choices made for this patch, > >> including how the defaults were derived, and open the conversation to > >> hear what the community believes is a reasonable way to allow users to > >> tune weighted interleave weights. More broadly, I hope to get gather > >> community insight on how they use weighted interleave, and do my best to > >> reflect those workflows in the patch. > > > > Weighted interleave has since moved onto v7 [1], and a v8 is currently being > > drafted. Through feedback from reviewers, we have landed on a coherent UAPI > > that gives users two options: auto mode, which leaves all weight calculation > > decisions to the system, and manual mode, which leaves weighted interleave > > the same as it is without the patch. > > > > Given that the patch's functionality is mostly concrete and that the questions > > I hoped to raise during this slot were answered via patch feedback, I hope to > > ask another question during the talk: > > > > Should the system dynamically change what metrics it uses to weight the nodes, > > based on what bottlenecks the system is currently facing? > > > > In the patch, min(read_bandwidth, write_bandwidth) is used as the heuristic > > to determine what a node's weight should be. However, what if the system is > > not bottlenecked by bandwidth, but by latency? A system could also be > > bottlenecked by read bandwidth, but not by write bandwidth. > > > > Consider a scenario where a system has many memory nodes with varying > > latencies and bandwidths. When the system is not bottlenecked by bandwidth, > > it might prefer to allocate memory from nodes with lower latency. Once the > > system starts feeling pressured by bandwidth, the weights for high bandwidth > > (but also high latency) nodes would slowly increase to alleviate pressure > > from the system. Once the system is back in a manageable state, weights for > > low latency nodes would start increasing again. Users would not have to be > > aware of any of this -- they would just see the system take control of the > > weight changes as the system's needs continue to change. > > IIUC, this assumes the capacity of all kinds of memory is large enough. > However, this may be not true in some cases. So, another possibility is > that, for a system with DRAM and CXL memory nodes. > > - There is free space on DRAM node, the bandwidth of DRAM node isn't > saturated, memory is allocated on DRAM node. > > - There is no free space on DRAM node, the bandwidth of DRAM node isn't > saturated, cold pages are migrated to CXL memory nodes, while hot > pages are migrated to DRAM memory nodes. > > - The bandwidth of DRAM node is saturated, hot pages are migrated to CXL > memory nodes. > > In general, I think that the real situation is complex and this makes it > hard to implement a good policy in kernel. So, I suspect that it's > better to start with the experiments in user space. Hi Ying, thank you so much for your feedback, as always! Yes, I agree. I brought up this idea out of curiosity, since I thought that there might be room to experiment with different configurations for weighted interleave auto-tuning. As you know, we use min(read_bw, write_bw), which I think is a good heuristic that works for the intent of the weighted interleave auto-tuning patch-- I wanted to know what a system might look like, that might use different heuristics given the system's state. But I think you are right that it is difficult to implement in kernel. Thanks again, Ying! Will you be attending LSFMMBPF this year? I would love to say hello in person : -) Have a great day! Joshua > > This proposal also has some concerns that need to be addressed: > > - How reactive should the system be, and how aggressively should it tune the > > weights? We don't want the system to overreact to short spikes in pressure. > > - Does dynamic weight adjusting lead to pages being "misplaced"? Should those > > "misplaced" pages be migrated? (probably not) > > - Does this need to be in the kernel? A userspace daemon that monitors kernel > > metrics has the ability to make the changes (via the nodeN interfaces). > > > > Thoughts & comments are appreciated! Thank you, and have a great day! > > Joshua > > > > [1] https://lore.kernel.org/all/20250305200506.2529583-1-joshua.hahnjy@xxxxxxxxx/ > > > > Sent using hkml (https://github.com/sjp38/hackermail) > > --- > Best Regards, > Huang, Ying Sent using hkml (https://github.com/sjp38/hackermail)