On 6/3/21 7:59 PM, Andres Freund wrote: > Hi, > > On 2021-06-01 15:58:25 +0100, Pavel Begunkov wrote: >> Should be interesting for a bunch of people, so we should first >> outline API and capabilities it should give. As I almost never >> had to deal with futexes myself, would especially love to hear >> use case, what might be lacking and other blind spots. > > I did chat with Jens about how useful futex support would be in io_uring, so I > should outline our / my needs. I'm off work this week though, so I don't think > I'll have much time to experiment. > > For postgres's AIO support (which I am working on) there are two, largely > independent, use-cases for desiring futex support in io_uring. > > The first is the ability to wait for locks (queued r/w locks, blocking > implemented via futexes) and IO at the same time, within one task. Quickly and > efficiently processing IO completions can improve whole-system latency and > throughput substantially in some cases (journalling, indexes and other > high-contention areas - which often have a low queue depth). This is true > *especially* when there also is lock contention, which tends to make efficient > IO scheduling harder. Can you give a quick pointer to futex uses in the postgres code or where they are? Can't find in master but want to see what types of futex operations are used and how. > The second use case is the ability to efficiently wait in several tasks for > one IO to be processed. The prototypical example here is group commit/journal > flush, where each task can only continue once the journal flush has > completed. Typically one of waiters has to do a small amount of work with the > completion (updating a few shared memory variables) before the other waiters > can be released. It is hard to implement this efficiently and race-free with > io_uring right now without adding locking around *waiting* on the completion > side (instead of just consumption of completions). One cannot just wait on the > io_uring, because of a) the obvious race that another process could reap all > completions between check and wait b) there is no good way to wake up other > waiters once the userspace portion of IO completion is through. IIRC, the idea is to have a link "I/O -> fut_wake(master_task or nr=1)", and then after getting some work done the woken task does wake(nr=all), presumably via sys_futex or io_uring. Is that right? As with this option userspace can't modify the memory on which futex sits, the wake in the patchset allows to do an atomic add similarly to FUTEX_WAKE_OP. However, I still have general concerns if that's a flexible enough way. When io_uring-BPF is added it can be offloaded to BPF programs probably together with "updating a few shared memory variables", but these are just thoughts for the future. > All answers for postgres: > >> 1) Do we need PI? > > Not right now. > > Not related to io_uring: I do wish there were a lower overhead (and lower > guarantees) version of PI futexes. Not for correctness reasons, but > performance. Granting the waiter's timeslice to the lock holder would improve > common contention scenarios with more runnable tasks than cores. > > >> 2) Do we need requeue? Anything else? > > I can see requeue being useful, but I haven't thought it through fully. > > Do the wake/wait ops as you have them right now support bitsets? No, but trivial to add >> 3) How hot waits are? May be done fully async avoiding io-wq, but >> apparently requires more changes in futex code. > > The waits can be quite hot, most prominently on low latency storage, but not > just. Thanks Andres, that clears it up. The next step would be to verify that FUTEX_WAKE_OP-style waking is enough. -- Pavel Begunkov