On Sat, Nov 11, 2023 at 03:54:55PM -0800, Dan Williams wrote: > tj@xxxxxxxxxx wrote: > > Hello, > > > > On Fri, Nov 10, 2023 at 10:42:39PM -0500, Gregory Price wrote: > > > On Fri, Nov 10, 2023 at 05:05:50PM -1000, tj@xxxxxxxxxx wrote: > > > Here, even if CXL actually becomes popular, how many are going to use memory > > hotplug and need to dynamically rebalance memory in actively running > > workloads? What's the scenario? Are there going to be an army of data center > > technicians going around plugging and unplugging CXL devices depending on > > system memory usage? > > While I have personal skepticism that all of the infrastructure in the > CXL specification is going to become popular, one mechanism that seems > poised to cross that threshold is "dynamic capacity". So it is not the > case that techs are running around hot-adjusting physical memory. A host > will have a cable hop to a shared memory pool in the rack where it can > be dynamically provisioned across hosts. > > However, even then the bounds of what is dynamic is going to be > constrained to a fixed address space with likely predictable performance > characteristics for that address range. That potentially allows for a > system wide memory interleave policy to be viable. That might be the > place to start and mirrors, at a coarser granularity, what hardware > interleaving can do. > > [..] Funny enough, this is exactly why I skipped cgroups and went directly to implementing the weights as an attribute of numa nodes. It cuts out a middle-man and lets you apply weights globally. BUT the policy is still ultimately opt-in, so you don't really get a global effect, just a global control. Just given that lesson, yeah it's better to reduce the scope to mempolicy first. Getting to global interleave weights from there... more complicated. The simplees way I can think of to test system-wide weighted interleave is to have the init task create a default mempolicy and have all tasks inherit it. That feels like a big, dumb hammer - but it might work. Comparatively, implementing a mempolicy in the root cgroup and having tasks use that directly "feels" better, though lessons form this patch - interating cgroup parent trees on allocations feels not great. Barring that, if a cgroup.mempolicy and a default mempolicy for init aren't realistic, I don't see a good path to fruition for a global interleave approach that doesn't require nastier allocator changes. In the meantime, unless there's other pro-cgroups voices, I'm going to pivot back to my initial approach of doing it in mempolicy, though I may explore extending mempolicy into procfs at the same time. ~Gregory