On 12/20/22 12:12?PM, Pavel Begunkov wrote: > On 12/20/22 18:10, Jens Axboe wrote: >> On 12/20/22 11:06?AM, Pavel Begunkov wrote: >>> On 12/20/22 17:58, Pavel Begunkov wrote: >>>> NOT FOR INCLUSION, needs some ring poll workarounds >>>> >>>> Flush completions is done either from the submit syscall or by the >>>> task_work, both are in the context of the submitter task, and when it >>>> goes for a single threaded rings like implied by ->task_complete, there >>>> won't be any waiters on ->cq_wait but the master task. That means that >>>> there can be no tasks sleeping on cq_wait while we run >>>> __io_submit_flush_completions() and so waking up can be skipped. >>> >>> Not trivial to benchmark as we need something to emulate a task_work >>> coming in the middle of waiting. I used the diff below to complete nops >>> in tw and removed preliminary tw runs for the "in the middle of waiting" >>> part. IORING_SETUP_SKIP_CQWAKE controls whether we use optimisation or >>> not. >>> >>> It gets around 15% more IOPS (6769526 -> 7803304), which correlates >>> to 10% of wakeup cost in profiles. Another interesting part is that >>> waitqueues are excessive for our purposes and we can replace cq_wait >>> with something less heavier, e.g. atomic bit set >> >> I was thinking something like that the other day, for most purposes >> the wait infra is too heavy handed for our case. If we exclude poll >> for a second, everything else is internal and eg doesn't need IRQ >> safe locking at all. That's just one part of it. But I didn't have > > Ring polling? We can move it to a separate waitqueue, probably with > some tricks to remove extra ifs from the hot path, which I'm > planning to add in v2. Yes, polling on the ring itself. And that was my thinking too, leave cq_wait just for that and then hide it behind <something something> to make it hopefully almost free for when the ring isn't polled. I just hadn't put any thought into what exactly that'd look like just yet. >> a good idea for the poll() side of things, which would be required >> to make some progress there. > > I'll play with replacing waitqueues with a bitops, should save some > extra ~5% with the benchmark I used. Excellent, looking forward to seeing that. -- Jens Axboe