Gregory Price <gregory.price@xxxxxxxxxxxx> writes: > On Tue, Nov 14, 2023 at 06:01:13PM +0100, Michal Hocko wrote: >> On Tue 14-11-23 10:50:51, Gregory Price wrote: >> > On Tue, Nov 14, 2023 at 10:43:13AM +0100, Michal Hocko wrote: >> [...] >> > > That being said, I still believe that a cgroup based interface is a much >> > > better choice over a global one. Cpusets seem to be a good fit as the >> > > controller does control memory placement wrt NUMA interfaces. >> > >> > I think cpusets is a non-starter due to the global spinlock required when >> > reading informaiton from it: >> > >> > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L391 >> >> Right, our current cpuset implementation indeed requires callback lock >> from the page allocator. But that is an implementation detail. I do not >> remember bug reports about the lock being a bottle neck though. If >> anything cpusets lock optimizations would be win also for users who do >> not want to use weighted interleave interface. > > Definitely agree, but that's a rather large increase of scope :[ > > We could consider a push-model similar to how cpuset nodemasks are > pushed down to mempolicies, rather than a pull-model of having > mempolicy read directly from cpusets, at least until cpusets lock > optimization is undertaken. > > This pattern looks like a wart to me, which is why I avoided it, but the > locking implications on the pull-model make me sad. > > Would like to point out that Tejun pushed back on implementing weights > in cgroups (regardless of subcomponent), so I think we need to come > to a consensus on where this data should live in a "more global" > context (cpusets, memcg, nodes, etc) before I go mucking around > further. > > So far we have: > * mempolicy: updating weights is a very complicated undertaking, > and no (good) way to do this from outside the task. > would be better to have a coarser grained control. > > New syscall is likely needed to add/set weights in the > per-task mempolicy, or bite the bullet on set_mempolicy2 > and make the syscall extensible for the future. > > * memtiers: tier=node when devices are already interleaved or when all > devices are different, so why add yet another layer of > complexity if other constructs already exist. Additionally, > you lose task-placement relative weighting (or it becomes > very complex to implement. Because we usually have multiple nodes in one mem-tier, I still think mem-tier-based interface is simpler than node-based. But, it seems more complex to introduce mem-tier into mempolicy. Especially if we have per-task weights. So, I am fine to go with node-based interface. > * cgroups: "this doesn't involve dynamic resource accounting / > enforcement at all" and "these aren't resource > allocations, it's unclear what the hierarchical > relationship mean". > > * node: too global, explore smaller scope first then expand. Why is it too global? I understand that it doesn't cover all possible use cases (although I don't know whether these use cases are practical or not). But it can provide a reasonable default per-node weight based on available node performance information (such as, HMAT, CDAT, etc.). And, quite some workloads can just use it. I think this is an useful feature. > For now I think there is consensus that mempolicy should have weights > per-task regardless of how the more-global mechanism is defined, so i'll > go ahead and put up another RFC for some options on that in the next > week or so. > > The limitations on the first pass will be that only the task is capable > of re-weighting should cpusets.mems or the nodemask change. -- Best Regards, Huang, Ying