Às 08:36 de 04/06/21, Nicholas Piggin escreveu: > Excerpts from André Almeida's message of June 4, 2021 5:59 am: >> Hi, >> >> This patch series introduces the futex2 syscalls. >> >> * What happened to the current futex()? >> >> For some years now, developers have been trying to add new features to >> futex, but maintainers have been reluctant to accept then, given the >> multiplexed interface full of legacy features and tricky to do big >> changes. Some problems that people tried to address with patchsets are: >> NUMA-awareness[0], smaller sized futexes[1], wait on multiple futexes[2]. >> NUMA, for instance, just doesn't fit the current API in a reasonable >> way. Considering that, it's not possible to merge new features into the >> current futex. >> >> ** The NUMA problem >> >> At the current implementation, all futex kernel side infrastructure is >> stored on a single node. Given that, all futex() calls issued by >> processors that aren't located on that node will have a memory access >> penalty when doing it. >> >> ** The 32bit sized futex problem >> >> Futexes are used to implement atomic operations in userspace. >> Supporting 8, 16, 32 and 64 bit sized futexes allows user libraries to >> implement all those sizes in a performant way. Thanks Boost devs for >> feedback: https://lists.boost.org/Archives/boost/2021/05/251508.php >> >> Embedded systems or anything with memory constrains could benefit of >> using smaller sizes for the futex userspace integer. >> >> ** The wait on multiple problem >> >> The use case lies in the Wine implementation of the Windows NT interface >> WaitMultipleObjects. This Windows API function allows a thread to sleep >> waiting on the first of a set of event sources (mutexes, timers, signal, >> console input, etc) to signal. Considering this is a primitive >> synchronization operation for Windows applications, being able to quickly >> signal events on the producer side, and quickly go to sleep on the >> consumer side is essential for good performance of those running over Wine. >> >> [0] https://lore.kernel.org/lkml/20160505204230.932454245@xxxxxxxxxxxxx/ >> [1] https://lore.kernel.org/lkml/20191221155659.3159-2-malteskarupke@xxxxxx/ >> [2] https://lore.kernel.org/lkml/20200213214525.183689-1-andrealmeid@xxxxxxxxxxxxx/ >> >> * The solution >> >> As proposed by Peter Zijlstra and Florian Weimer[3], a new interface >> is required to solve this, which must be designed with those features in >> mind. futex2() is that interface. As opposed to the current multiplexed >> interface, the new one should have one syscall per operation. This will >> allow the maintainability of the API if it gets extended, and will help >> users with type checking of arguments. >> >> In particular, the new interface is extended to support the ability to >> wait on any of a list of futexes at a time, which could be seen as a >> vectored extension of the FUTEX_WAIT semantics. >> >> [3] https://lore.kernel.org/lkml/20200303120050.GC2596@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ >> >> * The interface >> >> The new interface can be seen in details in the following patches, but >> this is a high level summary of what the interface can do: >> >> - Supports wake/wait semantics, as in futex() >> - Supports requeue operations, similarly as FUTEX_CMP_REQUEUE, but with >> individual flags for each address >> - Supports waiting for a vector of futexes, using a new syscall named >> futex_waitv() >> - Supports variable sized futexes (8bits, 16bits, 32bits and 64bits) >> - Supports NUMA-awareness operations, where the user can specify on >> which memory node would like to operate > > A few comments. > > - Do you need a syscall for wait and for waitv, or can wait just be a > degenerate case of waitv (which could use the stack variable)? I guess > it does save the user access unlock. > Yes. waitv with one element has a overhead compared to wait, given the extra user_copy(). Given that waiting on a single futex is the common case, I think it's worth to have it. > - Did you consider a wakev interface? An example is a read-write mutex > which has read-blocking futexes split (e.g., per-node) for scalability > then the writer may unlock and wake all readers with one call. We > actually have some scalability challenges of this nature with certain > large database programs. > Not really, I haven't heard any use case for that until now. It should be easy to implement it, though, and I think you have an interesting use case here. Could you point me some of those database programs? > - Great to see 64-bit support in, it is helpful to do some interesting > things with locks without hacks (e.g., putting an address in the lock > word). > > - Are we really keen on squashing node ID into flags in this day and age? > I guess okay but seems like it would be nice to allow a bit more space > in general for the operations. I don't want to turn it into a whole big > multiplexing nightmare again with lots of such flags, or propose > complexity with no code behind it, but I think we need a bit of leeway > for unforeseen locking innovations to be added carefully. The pthread > locking today is still fairly primitive really, I don't think we know > what will work best for the next 10 years. In the interface that I'd proposed, the node ID isn't part of the flags. You have a flag FUTEX_FLAG_NUMA, and when that is used, you pass in `void *uaddr` a pointer to a `struct futex_numa { int value, int hint }`, where hint should be the node ID you would like to work on, and value is just the userspace futex. This is documented in more details in patch 7 "docs: locking: futex2: Add documentation". If you have any feedback about how this NUMA interface looks like, I would like to hear. Also, did something in my writing indicated that the node ID would be part of the flags? I'll improve this it if so. > > One scalability issue we are starting to hit and will only get worse is > futex queue spinlock contention. Perhaps this is better addressed in > userspace but the kernel could play a part so I like to leave some doors > open. One example is that the wait (or wake) side may like to depend not > just on the memory value, but on the success of a cmpxchg to avoid > unqueueing and queueing spuriously, which increases lock contention but > also ends up putting the poor task on the back of the list -- yes RT > priorities can help the realtime case, but non-RT cases can get bad > outlier latencies if lock stealing is allowed (which can be very good > for performance). > Sorry, I'm not sure what do you mean here. Are you proposing to have a cmpxchg in kernel side, so the lock would be taken by the kernel, and not by the userspace like it's now? > - The private global futex hash table sucks for various reasons, and > having 128 waiters per thread makes it orders of magnitude easier for > userspace to DoS stuff with hash collisions. NUMA doesn't fix that, the > per process hashing that Thomas suggested does fix the DoS but the > non-deterministic hash collisions still seem to be a problem for real > time response, and at the other end of the scale some apps (certain > databases, etc) can have ten thousand futex waiters at once so birthday > paradox can also lead to guaranteed (low level) variable beahviour > within a single process. > > I know the kernel in general is not very robust against this kind of > DoS/nondeterminism, but it's a bit sad to introduce new APIs with the > problem still there. Yes we could address it later, but I think it's > better done first because the solution might influence what the best > syscall API is. > > For example the idea of using the address as the handle for the wait > queue _and_ the value is very convenient but hashing is annoying for > all the above reasons and the shared wait queue case is pretty clunky. > It's also constraining in some corner cases to have the wait queue > associated with the address 1:1. For example a type of NUMA mutex might > want to have per-node waitqueues associated with a lock word, and wake > local node waiters 99% of the time. Not trivial to do with futexes and > seems to at least require bouncing of more cache lines, possibly more > atomics, etc. > > Could anything else be done? I wasn't aware that userspace doing DoS is something to be concerned from the kernel point of view. Is this scenario considering a malicious actor? If so, there are plenty of resources to be denied, so not sure how futex could be protected of this. Or is this just a program that uses tons of futexes? > > I'll be burned at the stake for suggesting it but it would be great if > we could use file descriptors. At least for the shared futex, maybe > private could use a per-process futex allocator. It solves all of the > above, although I'm sure has many of its own problem. It may not play > so nicely with the pthread mutex API because of the whole static > initialiser problem, but the first futex proposal did use fds. But it's > an example of an alternate API. > FDs and futex doesn't play well, because for futex_wait() you need to tell the kernel the expected value in the futex address to avoid sleeping in a free lock. FD operations (poll, select) don't have this `value` argument, so they could sleep forever, but I'm not sure if you had taken this in consideration. > And.. thanks for the great work, apologies if I missed some discussion > related to the above comments, I did some reading but I know there has > been a lot of discussions going on. > :) and thank you for the valuable feedback and detailed ideas! > Thanks, > Nick >