Sorry saw this after the other email. On Fri, 25 Oct 2024, Peter Zijlstra wrote: > > Could we follow NUMA policies like with other metadata allocations during > > systen call processing? > > I had a quick look at this, and since the mempolicy stuff is per vma, > and we don't have the vma, this is going to be terribly expensive -- > mmap_lock and all that. There is a memory policy for the task as a whole that is used for slab allocations and allocations that are not vma bound in current->mempolicy. Use that. > Using memory policies is probably okay -- but still risky, since you get > the extra failure case where if you change the mempolicy between WAIT > and WAKE things will not match and sadness happens, but that *SHOULD* > hopefully not happen a lot. Mempolicies are typically fairly static. Right. > > That way the placement of the futex can be controlled by the tasks memory > > policy. We could skip the FUTEX2_NUMA option. > > That doesn't work. If we don't have storage for the node across > WAIT/WAKE, then the node must be deterministic per futex_hash(). > Otherwise wake has no chance of finding the entry. You can get a node number following the current task mempolicy by calling mempolicy_slab_node() and keep using that node for the future. It is also possible to check if the policy is interleave and then follow the distributed hash scheme. > The current scheme where we determine node based on hash bits is fully > deterministic and WAIT/WAKE will agree on which node-hash to use. The > interleave is no worse than the global hash today -- OTOH it also isn't > better. This is unexpected strange behavior for those familiar with NUMA. We have tools to set memory policies for tasks and those policies should be used throughout.