Re: [PATCH v4 00/15] Add futex2 syscalls

André Almeida <andrealmeid@xxxxxxxxxxxxx> · Tue, 8 Jun 2021 12:04:18 -0300

Às 11:23 de 08/06/21, Peter Zijlstra escreveu:
> On Tue, Jun 08, 2021 at 02:26:22PM +0200, Sebastian Andrzej Siewior wrote:
>> On 2021-06-07 12:40:54 [-0300], André Almeida wrote:
>>>
>>> When I first read Thomas proposal for per table process, I thought that
>>> the main goal there was to solve NUMA locality issues, not RT latency,
>>> but I think you are right. However, re-reading the thread at [0], it
>>> seems that the RT problems where not completely solved in that
>>> interface, maybe the people involved with that patchset can help to shed
>>> some light on it.
>>>
>>> Otherwise, this same proposal could be integrated in futex2, given that
>>> we would only need to provide to userland some extra flags and add some
>>> `if`s around the hash table code (in a very similar way the NUMA code
>>> will be implemented in futex2).
>>
>> There are slides at [0] describing some attempts and the kernel tree [1]
>> from that time.
>>
>> The process-table solves the problem to some degree that two random
>> process don't collide on the same hash bucket. But as Peter Zijlstra
>> pointed out back then two threads from the same task could collide on
>> the same hash bucket (and with ASLR not always). So the collision is
>> there but limited and this was not perfect.
>>
>> All the attempts with API extensions didn't go well because glibc did
>> not want to change a bit. This starts with a mutex that has a static
>> initializer which has to work (I don't remember why the first
>> pthread_mutex_lock() could not fail with -ENOMEM but there was
>> something) and ends with glibc's struct mutex which is full and has no
>> room for additional data storage.
>>
>> The additional data in user's struct mutex + init would have the benefit
>> that instead uaddr (which is hashed for the in-kernel lookup) a cookie
>> could be used for the hash-less lookup (and NUMA pointer where memory
>> should be stored).
>>
>> So. We couldn't change a thing back then so nothing did happen. We
>> didn't want to create a new interface and a library implementing it plus
>> all the functionality around it (like pthread_cond, phtread_barrier, …).
>> Not to mention that if glibc continues to use the "old" locking
>> internally then the application is still affected by the hash-collision
>> locking (or the NUMA problem) should it block on the lock.
> 
> There's more futex users than glibc, and some of them are really hurting
> because of the NUMA issue. Oracle used to (I've no idea what they do or
> do not do these days) use sysvsem because the futex hash table was a
> massive bottleneck for them.
> 
> And as Nick said, other vendors are having the same problems.

Since we're talking about NUMA, which userspace communities would be
able to provide feedback about the futex2() NUMA-aware feature, to check
if this interface would help solving those issues?

> 
> And if you don't extend the futex to store the nid you put the waiter in
> (see all the problems above) you will have to do wakeups on all nodes,
> which is both slower than it is today, and scales possibly even worse.
> 
> The whole numa-aware qspinlock saga is in part because of futex.
> 
> 
> That said; if we're going to do the whole futex-vector thing, we really
> do need a new interface, because the futex multiplex monster is about to
> crumble (see the fun wrt timeouts for example).
> 
> And if we're going to do a new interface, we ought to make one that can
> solve all these problems. Now, ideally glibc will bring forth some
> opinions, but if they don't want to play, we'll go back to the good old
> days of non-standard locking libraries.. we're halfway there already due
> to glibc not wanting to break with POSIX were we know POSIX was just
> dead wrong broken.
> 
> See: https://github.com/dvhart/librtpi
> 
>