Re: [PATCH v1 11/14] futex: Implement FUTEX2_NUMA

Peter Zijlstra <peterz@xxxxxxxxxxxxx> · Fri, 25 Oct 2024 10:58:15 +0200

On Wed, Jun 12, 2024 at 10:23:00AM -0700, Christoph Lameter (Ampere) wrote:

> > When FUTEX2_NUMA is not set, the node is simply an extention of the
> > hash, such that traditional futexes are still interleaved over the
> > nodes.
> 
> Could we follow NUMA policies like with other metadata allocations during
> systen call processing? 

I had a quick look at this, and since the mempolicy stuff is per vma,
and we don't have the vma, this is going to be terribly expensive --
mmap_lock and all that.

Once lockless vma lookups land (soonish, perhaps), this could be
reconsidered. But for now there just isn't a sane way to do this.

Using memory policies is probably okay -- but still risky, since you get
the extra failure case where if you change the mempolicy between WAIT
and WAKE things will not match and sadness happens, but that *SHOULD*
hopefully not happen a lot. Mempolicies are typically fairly static.

> If there is no NUMA task policy then the futex
> should be placed on the local NUMA node.

> That way the placement of the futex can be controlled by the tasks memory
> policy. We could skip the FUTEX2_NUMA option.

That doesn't work. If we don't have storage for the node across
WAIT/WAKE, then the node must be deterministic per futex_hash().
Otherwise wake has no chance of finding the entry.

Consider our random unbound task with no policies etc. (default state)
doing FUTEX_WAIT and going to sleep while on node-0, it's sibling
thread, that happens to run on node-1 issues FUTEX_WAKE.

If they disagree on determining 'node', then they will not find match
and the wakeup doesn't happen and userspace gets really sad.

The current scheme where we determine node based on hash bits is fully
deterministic and WAIT/WAKE will agree on which node-hash to use. The
interleave is no worse than the global hash today -- OTOH it also isn't
better.