On Thu 27-10-22 17:31:35, Huang, Ying wrote: > Michal Hocko <mhocko@xxxxxxxx> writes: > > > On Thu 27-10-22 15:39:00, Huang, Ying wrote: > >> Michal Hocko <mhocko@xxxxxxxx> writes: > >> > >> > On Thu 27-10-22 14:47:22, Huang, Ying wrote: > >> >> Michal Hocko <mhocko@xxxxxxxx> writes: > >> > [...] > >> >> > I can imagine workloads which wouldn't like to get their memory demoted > >> >> > for some reason but wouldn't it be more practical to tell that > >> >> > explicitly (e.g. via prctl) rather than configuring cpusets/memory > >> >> > policies explicitly? > >> >> > >> >> If my understanding were correct, prctl() configures the process or > >> >> thread. > >> > > >> > Not necessarily. There are properties which are per adddress space like > >> > PR_[GS]ET_THP_DISABLE. This could be very similar. > >> > > >> >> How can we get process/thread configuration at demotion time? > >> > > >> > As already pointed out in previous emails. You could hook into > >> > folio_check_references path, more specifically folio_referenced_one > >> > where you have all that you need already - all vmas mapping the page and > >> > then it is trivial to get the corresponding vm_mm. If at least one of > >> > them has the flag set then the demotion is not allowed (essentially the > >> > same model as VM_LOCKED). > >> > >> Got it! Thanks for detailed explanation. > >> > >> One bit may be not sufficient. For example, if we want to avoid or > >> control cross-socket demotion and still allow demoting to slow memory > >> nodes in local socket, we need to specify a node mask to exclude some > >> NUMA nodes from demotion targets. > > > > Isn't this something to be configured on the demotion topology side? Or > > do you expect there will be per process/address space usecases? I mean > > different processes running on the same topology, one requesting local > > demotion while other ok with the whole demotion topology? > > I think that it's possible for different processes have different > requirements. > > - Some processes don't care about where the memory is placed, prefer > local, then fall back to remote if no free space. > > - Some processes want to avoid cross-socket traffic, bind to nodes of > local socket. > > - Some processes want to avoid to use slow memory, bind to fast memory > node only. Yes, I do understand that. Do you have any specific examples in mind? [...] > > If we really need/want to give a fine grained control over demotion > > nodemask then we would have to go with vma->mempolicy interface. In > > any case a per process on/off knob sounds like a reasonable first step > > before we learn more about real usecases. > > Yes. Per-mm or per-vma property is much better than per-task property. > Another possibility, how about add a new flag to set_mempolicy() system > call to set the per-mm mempolicy? `numactl` can use that by default. Do you mean a flag to control whether the given policy is applied to a task or mm? -- Michal Hocko SUSE Labs