Re: [LSF/MM/BPF TOPIC] Weighted interleave auto-tuning

Joshua Hahn <joshua.hahnjy@xxxxxxxxx> · Fri, 14 Mar 2025 08:02:46 -0700

On Fri, 14 Mar 2025 18:08:35 +0800 "Huang, Ying" <ying.huang@xxxxxxxxxxxxxxxxx> wrote:

> Joshua Hahn <joshua.hahnjy@xxxxxxxxx> writes:
> 
> > On Thu,  9 Jan 2025 13:50:48 -0500 Joshua Hahn <joshua.hahnjy@xxxxxxxxx> wrote:
> >
> >> Hello everyone, I hope everyone has had a great start to 2025!
> >> 
> >> Recently, I have been working on a patch series [1] with
> >> Gregory Price <gourry@xxxxxxxxxx> that provides new default interleave
> >> weights, along with dynamic re-weighting on hotplug events and a series
> >> of UAPIs that allow users to configure how they want the defaults to behave.
> >> 
> >> In introducing these new defaults, discussions have opened up in the
> >> community regarding how best to create a UAPI that can provide
> >> coherent and transparent interactions for the user. In particular, consider
> >> this scenario: when a hotplug event happens and a node comes online
> >> with new bandwidth information (and therefore changing the bandwidth
> >> distributions across the system), should user-set weights be overwritten
> >> to reflect the new distributions? If so, how can we justify overwriting
> >> user-set values in a sysfs interface? If not, how will users manually
> >> adjust the node weights to the optimal weight?
> >> 
> >> I would like to revisit some of the design choices made for this patch,
> >> including how the defaults were derived, and open the conversation to
> >> hear what the community believes is a reasonable way to allow users to
> >> tune weighted interleave weights. More broadly, I hope to get gather
> >> community insight on how they use weighted interleave, and do my best to
> >> reflect those workflows in the patch.
> >
> > Weighted interleave has since moved onto v7 [1], and a v8 is currently being
> > drafted. Through feedback from reviewers, we have landed on a coherent UAPI
> > that gives users two options: auto mode, which leaves all weight calculation
> > decisions to the system, and manual mode, which leaves weighted interleave
> > the same as it is without the patch.
> >
> > Given that the patch's functionality is mostly concrete and that the questions
> > I hoped to raise during this slot were answered via patch feedback, I hope to
> > ask another question during the talk:
> >
> > Should the system dynamically change what metrics it uses to weight the nodes,
> > based on what bottlenecks the system is currently facing?
> >
> > In the patch, min(read_bandwidth, write_bandwidth) is used as the heuristic
> > to determine what a node's weight should be. However, what if the system is
> > not bottlenecked by bandwidth, but by latency? A system could also be
> > bottlenecked by read bandwidth, but not by write bandwidth.
> >
> > Consider a scenario where a system has many memory nodes with varying
> > latencies and bandwidths. When the system is not bottlenecked by bandwidth,
> > it might prefer to allocate memory from nodes with lower latency. Once the
> > system starts feeling pressured by bandwidth, the weights for high bandwidth
> > (but also high latency) nodes would slowly increase to alleviate pressure
> > from the system. Once the system is back in a manageable state, weights for
> > low latency nodes would start increasing again. Users would not have to be
> > aware of any of this -- they would just see the system take control of the
> > weight changes as the system's needs continue to change.
> 
> IIUC, this assumes the capacity of all kinds of memory is large enough.
> However, this may be not true in some cases.  So, another possibility is
> that, for a system with DRAM and CXL memory nodes.
> 
> - There is free space on DRAM node, the bandwidth of DRAM node isn't
>   saturated, memory is allocated on DRAM node.
> 
> - There is no free space on DRAM node, the bandwidth of DRAM node isn't
>   saturated, cold pages are migrated to CXL memory nodes, while hot
>   pages are migrated to DRAM memory nodes.
> 
> - The bandwidth of DRAM node is saturated, hot pages are migrated to CXL
>   memory nodes.
> 
> In general, I think that the real situation is complex and this makes it
> hard to implement a good policy in kernel.  So, I suspect that it's
> better to start with the experiments in user space.

Hi Ying, thank you so much for your feedback, as always!

Yes, I agree. I brought up this idea out of curiosity, since I thought that
there might be room to experiment with different configurations for weighted
interleave auto-tuning. As you know, we use min(read_bw, write_bw), which I
think is a good heuristic that works for the intent of the weighted interleave
auto-tuning patch-- I wanted to know what a system might look like, that might
use different heuristics given the system's state. But I think you are right
that it is difficult to implement in kernel.

Thanks again, Ying! Will you be attending LSFMMBPF this year? I would love to
say hello in person : -)

Have a great day!
Joshua

> > This proposal also has some concerns that need to be addressed:
> > - How reactive should the system be, and how aggressively should it tune the
> >   weights? We don't want the system to overreact to short spikes in pressure.
> > - Does dynamic weight adjusting lead to pages being "misplaced"? Should those
> >   "misplaced" pages be migrated? (probably not)
> > - Does this need to be in the kernel? A userspace daemon that monitors kernel
> >   metrics has the ability to make the changes (via the nodeN interfaces).
> >
> > Thoughts & comments are appreciated! Thank you, and have a great day!
> > Joshua
> >
> > [1] https://lore.kernel.org/all/20250305200506.2529583-1-joshua.hahnjy@xxxxxxxxx/
> >
> > Sent using hkml (https://github.com/sjp38/hackermail)
> 
> ---
> Best Regards,
> Huang, Ying

Sent using hkml (https://github.com/sjp38/hackermail)