On 6/5/24 1:13 PM, Pavel Begunkov wrote: > On 6/5/24 17:31, Pavel Begunkov wrote: >> On 6/5/24 16:11, Pavel Begunkov wrote: >>> On 6/4/24 20:01, Jens Axboe wrote: >>>> io_uring currently uses percpu refcounts for the ring reference. This >>>> works fine, but exiting a ring requires an RCU grace period to lapse >>>> and this slows down ring exit quite a lot. >>>> >>>> Add a basic per-cpu counter for our references instead, and use that. >>> >>> All the synchronisation heavy lifting is done by RCU, what >>> makes it safe to read other CPUs counters in >>> io_ring_ref_maybe_done()? >> >> Other options are expedited RCU (Paul saying it's an order of >> magnitude faster), or to switch to plain atomics since it's cached, >> but it's only good if submitter and waiter are the same task. Paul > > I mixed it with task refs, ctx refs should be cached well > for any configuration as they're bound to requests (and req > caches). That's a good point, maybe even our current RCU approach is overkill since we do the caching pretty well. Let me run a quick test, just switching this to a basic atomic_t. The dead mask can just be the 31st bit. -- Jens Axboe