Re: [PATCH v1 11/14] futex: Implement FUTEX2_NUMA

"Christoph Lameter (Ampere)" <cl@xxxxxxxxxx> · Fri, 25 Oct 2024 12:36:28 -0700 (PDT)

Sorry saw this after the other email.

On Fri, 25 Oct 2024, Peter Zijlstra wrote:

> > Could we follow NUMA policies like with other metadata allocations during
> > systen call processing?
>
> I had a quick look at this, and since the mempolicy stuff is per vma,
> and we don't have the vma, this is going to be terribly expensive --
> mmap_lock and all that.

There is a memory policy for the task as a whole that is used for slab
allocations and allocations that are not vma bound in current->mempolicy.
Use that.

> Using memory policies is probably okay -- but still risky, since you get
> the extra failure case where if you change the mempolicy between WAIT
> and WAKE things will not match and sadness happens, but that *SHOULD*
> hopefully not happen a lot. Mempolicies are typically fairly static.

Right.

> > That way the placement of the futex can be controlled by the tasks memory
> > policy. We could skip the FUTEX2_NUMA option.
>
> That doesn't work. If we don't have storage for the node across
> WAIT/WAKE, then the node must be deterministic per futex_hash().
> Otherwise wake has no chance of finding the entry.

You can get a node number following the current task mempolicy by calling
mempolicy_slab_node() and keep using that node for the future.

It is also possible to check if the policy is interleave and then follow
the distributed hash scheme.

> The current scheme where we determine node based on hash bits is fully
> deterministic and WAIT/WAKE will agree on which node-hash to use. The
> interleave is no worse than the global hash today -- OTOH it also isn't
> better.

This is unexpected strange behavior for those familiar with NUMA. We have
tools to set memory policies for tasks and those policies should be used
throughout.