Michal Hocko <mhocko@xxxxxxxx> writes: > On Thu 27-10-22 17:31:35, Huang, Ying wrote: >> Michal Hocko <mhocko@xxxxxxxx> writes: >> >> > On Thu 27-10-22 15:39:00, Huang, Ying wrote: >> >> Michal Hocko <mhocko@xxxxxxxx> writes: >> >> >> >> > On Thu 27-10-22 14:47:22, Huang, Ying wrote: >> >> >> Michal Hocko <mhocko@xxxxxxxx> writes: >> >> > [...] >> >> >> > I can imagine workloads which wouldn't like to get their memory demoted >> >> >> > for some reason but wouldn't it be more practical to tell that >> >> >> > explicitly (e.g. via prctl) rather than configuring cpusets/memory >> >> >> > policies explicitly? >> >> >> >> >> >> If my understanding were correct, prctl() configures the process or >> >> >> thread. >> >> > >> >> > Not necessarily. There are properties which are per adddress space like >> >> > PR_[GS]ET_THP_DISABLE. This could be very similar. >> >> > >> >> >> How can we get process/thread configuration at demotion time? >> >> > >> >> > As already pointed out in previous emails. You could hook into >> >> > folio_check_references path, more specifically folio_referenced_one >> >> > where you have all that you need already - all vmas mapping the page and >> >> > then it is trivial to get the corresponding vm_mm. If at least one of >> >> > them has the flag set then the demotion is not allowed (essentially the >> >> > same model as VM_LOCKED). >> >> >> >> Got it! Thanks for detailed explanation. >> >> >> >> One bit may be not sufficient. For example, if we want to avoid or >> >> control cross-socket demotion and still allow demoting to slow memory >> >> nodes in local socket, we need to specify a node mask to exclude some >> >> NUMA nodes from demotion targets. >> > >> > Isn't this something to be configured on the demotion topology side? Or >> > do you expect there will be per process/address space usecases? I mean >> > different processes running on the same topology, one requesting local >> > demotion while other ok with the whole demotion topology? >> >> I think that it's possible for different processes have different >> requirements. >> >> - Some processes don't care about where the memory is placed, prefer >> local, then fall back to remote if no free space. >> >> - Some processes want to avoid cross-socket traffic, bind to nodes of >> local socket. >> >> - Some processes want to avoid to use slow memory, bind to fast memory >> node only. > > Yes, I do understand that. Do you have any specific examples in mind? > [...] Sorry, I don't have specific examples. >> > If we really need/want to give a fine grained control over demotion >> > nodemask then we would have to go with vma->mempolicy interface. In >> > any case a per process on/off knob sounds like a reasonable first step >> > before we learn more about real usecases. >> >> Yes. Per-mm or per-vma property is much better than per-task property. >> Another possibility, how about add a new flag to set_mempolicy() system >> call to set the per-mm mempolicy? `numactl` can use that by default. > > Do you mean a flag to control whether the given policy is applied to a > task or mm? Yes. That is the idea. Best Regards, Huang, Ying