On Tue 27-09-22 21:07:02, Abel Wu wrote: > On 9/27/22 6:49 PM, Michal Hocko wrote: > > On Tue 27-09-22 11:20:54, Abel Wu wrote: > > [...] > > > > > Btw.in order to add per-thread-group mempolicy, is it possible to add > > > > > mempolicy in mm_struct? > > > > > > > > I dunno. This would make the mempolicy interface even more confusing. > > > > Per mm behavior makes a lot of sense but we already do have per-thread > > > > semantic so I would stick to it rather than introducing a new semantic. > > > > > > > > Why is this really important? > > > > > > We want soft control on memory footprint of background jobs by applying > > > NUMA preferences when necessary, so the impact on different NUMA nodes > > > can be managed to some extent. These NUMA preferences are given by the > > > control panel, and it might not be suitable to overwrite the tasks with > > > specific memory policies already (or vice versa). > > > > Maybe the answer is somehow implicit but I do not really see any > > argument for the per thread-group semantic here. In other words why a > > new interface has to cover more than the local [sg]et_mempolicy? > > I can see convenience as one potential argument. Also if there is a > > requirement to change the policy in atomic way then this would require a > > single syscall. > > Convenience is not our major concern. A well-tuned workload can have > specific memory policies for different tasks/vmas in one process, and > this can be achieved by set_mempolicy()/mbind() respectively. While > other workloads are not, they don't care where the memory residents, > so the impact they brought on the co-located workloads might vary in > different NUMA nodes. > > The control panel, which has a full knowledge of workload profiling, > may want to interfere the behavior of the non-mempolicied processes > by giving them NUMA preferences, to better serve the co-located jobs. > > So in this scenario, a process's memory policy can be assigned by two > objects dynamically: > > a) the process itself, through set_mempolicy()/mbind() > b) the control panel, but API is not available right now > > Considering the two policies should not fight each other, it sounds > reasonable to introduce a new syscall to assign memory policy to a > process through struct mm_struct. So you want to allow restoring the original local policy if the external one is disabled? Anyway, pidfd_$FOO behavior should be semantically very similar to the original $FOO. Moving from per-task to per-mm is a major shift in the semantic. I can imagine to have a dedicated flag for the syscall to enfore the policy to the full thread group. But having a different semantic is both tricky and also constrained because per-thread binding is then impossible. -- Michal Hocko SUSE Labs