On Fri, Nov 10, 2023 at 05:05:50PM -1000, tj@xxxxxxxxxx wrote: > Hello, Gregory. > > On Fri, Nov 10, 2023 at 05:29:25PM -0500, Gregory Price wrote: > > Unfortunately mpol has no way of being changed from outside the task > > itself once it's applied, other than changing its nodemasks via cpusets. > > Maybe it's time to add one? > I've been considering this as well, but there's more context here being lost. It's not just about being able to toggle the policy of a single task, or related tasks, but actually in support of a more global data interleaving strategy that makes use of bandwidth more effectively as we begin to memory expansion and bandwidth expansion occur on the PCIE/CXL bus. If the memory landscape of a system changes, for example due to a hotplug event, you actually want to change the behavior of *every* task that is using interleaving. The fundamental bandwidth distribution of the entire system changed, so the behavior of every task using that memory should change with it. We've explored adding weights to: mempolicy, memory tiers, nodes, memcg, and now additionally cpusets. In the last email, I'd asked whether it might actually be worth adding a new mpol component of cgroups to aggregate these issues, rather than jam them into either component. I would love your thoughts on that. > > So one concrete use case: kubernetes might like change cpusets or move > > tasks from one cgroup to another, or a vm might be migrated from one set > > of nodes to enother (technically not mutually exclusive here). Some > > memory policy settings (like weights) may no longer apply when this > > happens, so it would be preferable to have a way to change them. > > Neither covers all use cases. As you noted in your mempolicy message, if the > application wants finer grained control, cgroup interface isn't great. In > general, any changes which are dynamically initiated by the application > itself isn't a great fit for cgroup. > It is certainly simple enough to add weights to mempolicy, but there are limitations. In particular, mempolicy is extremely `current task` focused, and significant refactor work would need to be done to allow external tasks the ability to toggle a target task's mempolicy. In particular I worry about the potential concurrency issues since mempolicy can be in the hot allocation path. (additionally, as you note below, you would have to hit every child thread separately to make effective changes, since it is per-task). I'm not opposed to this, but it was suggested to me that maybe there is a better place to place these weights. Maybe it can be managed mostly through RCU, though, so maybe the concern is overblow. Anyway... It's certainly my intent to add weights to mempolicy, as that's where I started. If that is the preferred starting point from the perspective of the mm community, I will revert to proposing set_mempolicy2 and/or full on converting mempolicy into a sys/procfs friendly mechanism. The goal here is to enable mempolicy, or something like it, to acquire additional flexibility in a heterogeneous memory world, considering how threads may be migrated, checkpointed/restored, and how use cases like bandwidth expansion may be insufficiently serviced by something as fine grained as per-task mempolicies. > I'm generally pretty awry of adding non-resource group configuration > interface especially when they don't have counter part in the regular > per-process/thread API for a few reasons: > > 1. The reason why people try to add those through cgroup somtimes is because > it seems easier to add those new features through cgroup, which may be > true to some degree, but shortcuts often aren't very conducive to long > term maintainability. > Concur. That's why i originally proposed the mempolicy extension, since I wasn't convinced by global settings, but I've been brought around by the fact that migrations and hotplug events may want to affect mass changes across a large number of unrelated tasks. > 2. As noted above, just having cgroup often excludes a signficant portion of > use cases. Not all systems enable cgroups and programatic accesses from > target processes / threads are coarse-grained and can be really awakward. > > 3. Cgroup can be convenient when group config change is necessary. However, > we really don't want to keep adding kernel interface just for changing > configs for a group of threads. For config changes which aren't high > frequency, userspace iterating the member processes and applying the > changes if possible is usually good enough which usually involves looping > until no new process is found. If the looping is problematic, cgroup > freezer can be used to atomically stop all member threads to provide > atomicity too. > If I can ask, do you think it would be out of line to propose a major refactor to mempolicy to enable external task's the ability to change a running task's mempolicy *as well as* a cgroup-wide mempolicy component? As you've alluded to here, I don't think either mechanism on their own is sufficient to handle all use cases, but the two combined does seem sufficient. I do appreciate the feedback here, thank you. I think we are getting to the bottom of how/where such new mempolicy mechanisms should be implemented. Gregory: