On Mon, Oct 31, 2022 at 03:32:34PM +0100, Michal Hocko wrote: > > > OK, then let's stop any complicated solution right here then. Let's > > > start simple with a per-mm flag to disable demotion of an address space. > > > Should there ever be a real demand for a more fine grained solution > > > let's go further but I do not think we want a half baked solution > > > without real usecases. > > > > Yes, the concern about the high cost for mempolicy from you and Yang is > > valid. > > > > How about the cpuset part? > > Cpusets fall into the same bucket as per task mempolicies wrt costs. Geting a > cpuset requires knowing all tasks associated with a page. Or am I just > missing any magic? And no memcg->cpuset association is not a proper > solution at all. No, you are not missing anything. It's really difficult to find a solution for all holes. And the patch is actually a best-efforts approach, trying to cover cgroup v2 + memory controller enabled case, which we think is a common user case for newer platforms with tiering memory. > > We've got bug reports from different channels > > about using cpuset+docker to control meomry placement in memory tiering > > system, leading to 2 commits solving them: > > > > 2685027fca38 ("cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in > > cpuset_init_smp()") > > https://lore.kernel.org/all/20220419020958.40419-1-feng.tang@xxxxxxxxx/ > > > > 8ca1b5a49885 ("mm/page_alloc: detect allocation forbidden by cpuset and > > bail out early") > > https://lore.kernel.org/all/1632481657-68112-1-git-send-email-feng.tang@xxxxxxxxx/ > > > > >From these bug reports, I think it's reasonable to say there are quite > > some real world users using cpuset+docker+memory-tiering-system. > > I don't think anybody is questioning existence of those usecases. The > primary question is whether any of them really require any non-trivial > (read nodemask aware) demotion policies. In other words do we know of > cpuset policy setups where demotion fallbacks are (partially) excluded? For cpuset numa memory binding, there are possible usercases: * User wants cpuset to bind some important containers to faster memory tiers for better latency/performance (where simply disabling demotion should work, like your per-mm flag solution) * User wants to bind to a set of physically closer nodes (like faster CPU+DRAM node and slower PMEM node). With initial demotion code, our HW will have 1:1 demotion/promotion pair for one DRAM node and its closer PMEM node, and user's binding can work fine. And there are many other types of memory tiering system from other vendors, like many CPU-less DRAM nodes in system, and Aneesh's patchset[1] created a more general tiering interface, where IIUC each tier has a nodemask, and an upper tier can demote to the whole lower tier, where the demotion path is N:N mapping. And for this, fine-tuning cpuset nodes binding needs this handling. [1]. https://lore.kernel.org/lkml/20220818131042.113280-1-aneesh.kumar@xxxxxxxxxxxxx/ Thanks, Feng > -- > Michal Hocko > SUSE Labs