Hi, On 2021-06-01 15:58:25 +0100, Pavel Begunkov wrote: > Should be interesting for a bunch of people, so we should first > outline API and capabilities it should give. As I almost never > had to deal with futexes myself, would especially love to hear > use case, what might be lacking and other blind spots. I did chat with Jens about how useful futex support would be in io_uring, so I should outline our / my needs. I'm off work this week though, so I don't think I'll have much time to experiment. For postgres's AIO support (which I am working on) there are two, largely independent, use-cases for desiring futex support in io_uring. The first is the ability to wait for locks (queued r/w locks, blocking implemented via futexes) and IO at the same time, within one task. Quickly and efficiently processing IO completions can improve whole-system latency and throughput substantially in some cases (journalling, indexes and other high-contention areas - which often have a low queue depth). This is true *especially* when there also is lock contention, which tends to make efficient IO scheduling harder. The second use case is the ability to efficiently wait in several tasks for one IO to be processed. The prototypical example here is group commit/journal flush, where each task can only continue once the journal flush has completed. Typically one of waiters has to do a small amount of work with the completion (updating a few shared memory variables) before the other waiters can be released. It is hard to implement this efficiently and race-free with io_uring right now without adding locking around *waiting* on the completion side (instead of just consumption of completions). One cannot just wait on the io_uring, because of a) the obvious race that another process could reap all completions between check and wait b) there is no good way to wake up other waiters once the userspace portion of IO completion is through. All answers for postgres: > 1) Do we need PI? Not right now. Not related to io_uring: I do wish there were a lower overhead (and lower guarantees) version of PI futexes. Not for correctness reasons, but performance. Granting the waiter's timeslice to the lock holder would improve common contention scenarios with more runnable tasks than cores. > 2) Do we need requeue? Anything else? I can see requeue being useful, but I haven't thought it through fully. Do the wake/wait ops as you have them right now support bitsets? > 3) How hot waits are? May be done fully async avoiding io-wq, but > apparently requires more changes in futex code. The waits can be quite hot, most prominently on low latency storage, but not just. Greetings, Andres Freund