Re: [External] Re: [RFC] proc: Add a new isolated /proc/pid/mempolicy type.

Michal Hocko <mhocko@xxxxxxxx> · Tue, 27 Sep 2022 15:58:52 +0200

On Tue 27-09-22 21:07:02, Abel Wu wrote:
> On 9/27/22 6:49 PM, Michal Hocko wrote:
> > On Tue 27-09-22 11:20:54, Abel Wu wrote:
> > [...]
> > > > > Btw.in order to add per-thread-group mempolicy, is it possible to add
> > > > > mempolicy in mm_struct?
> > > > 
> > > > I dunno. This would make the mempolicy interface even more confusing.
> > > > Per mm behavior makes a lot of sense but we already do have per-thread
> > > > semantic so I would stick to it rather than introducing a new semantic.
> > > > 
> > > > Why is this really important?
> > > 
> > > We want soft control on memory footprint of background jobs by applying
> > > NUMA preferences when necessary, so the impact on different NUMA nodes
> > > can be managed to some extent. These NUMA preferences are given by the
> > > control panel, and it might not be suitable to overwrite the tasks with
> > > specific memory policies already (or vice versa).
> > 
> > Maybe the answer is somehow implicit but I do not really see any
> > argument for the per thread-group semantic here. In other words why a
> > new interface has to cover more than the local [sg]et_mempolicy?
> > I can see convenience as one potential argument. Also if there is a
> > requirement to change the policy in atomic way then this would require a
> > single syscall.
> 
> Convenience is not our major concern. A well-tuned workload can have
> specific memory policies for different tasks/vmas in one process, and
> this can be achieved by set_mempolicy()/mbind() respectively. While
> other workloads are not, they don't care where the memory residents,
> so the impact they brought on the co-located workloads might vary in
> different NUMA nodes.
> 
> The control panel, which has a full knowledge of workload profiling,
> may want to interfere the behavior of the non-mempolicied processes
> by giving them NUMA preferences, to better serve the co-located jobs.
> 
> So in this scenario, a process's memory policy can be assigned by two
> objects dynamically:
> 
>  a) the process itself, through set_mempolicy()/mbind()
>  b) the control panel, but API is not available right now
> 
> Considering the two policies should not fight each other, it sounds
> reasonable to introduce a new syscall to assign memory policy to a
> process through struct mm_struct.

So you want to allow restoring the original local policy if the external
one is disabled?

Anyway, pidfd_$FOO behavior should be semantically very similar to the
original $FOO. Moving from per-task to per-mm is a major shift in the
semantic.  I can imagine to have a dedicated flag for the syscall to
enfore the policy to the full thread group. But having a different
semantic is both tricky and also constrained because per-thread binding
is then impossible.
-- 
Michal Hocko
SUSE Labs