On Fri, Oct 25, 2024 at 12:28:54PM -0700, Christoph Lameter (Ampere) wrote: > On Fri, 25 Oct 2024, Peter Zijlstra wrote: > > > Extend the futex2 interface to be numa aware. > > > > When FUTEX2_NUMA is specified for a futex, the user value is extended > > to two words (of the same size). The first is the user value we all > > know, the second one will be the node to place this futex on. > > > > struct futex_numa_32 { > > u32 val; > > u32 node; > > }; > > > > When node is set to ~0, WAIT will set it to the current node_id such > > that WAKE knows where to find it. If userspace corrupts the node value > > between WAIT and WAKE, the futex will not be found and no wakeup will > > happen. > > > > When FUTEX2_NUMA is not set, the node is simply an extention of the > > hash, such that traditional futexes are still interleaved over the > > nodes. > > > Would it be possible to follow the NUMA memory policy set up for a task > when making these decisions? We may not need a separate FUTEX2_NUMA > option. There are supportive functions in mm/mempolicy.c that will yield > a node for the futex logic to use. Using get_task_policy() seems very dangerous to me. It is explicitly possible for different tasks in a process to have different policies, which means (private) futexes would fail to work correctly. We need something that is process wide consistent -- like the vma policies. Except at current, those are to expensive to readily access.