On 6/5/24 1:29 PM, Jens Axboe wrote: > On 6/5/24 1:13 PM, Pavel Begunkov wrote: >> On 6/5/24 17:31, Pavel Begunkov wrote: >>> On 6/5/24 16:11, Pavel Begunkov wrote: >>>> On 6/4/24 20:01, Jens Axboe wrote: >>>>> io_uring currently uses percpu refcounts for the ring reference. This >>>>> works fine, but exiting a ring requires an RCU grace period to lapse >>>>> and this slows down ring exit quite a lot. >>>>> >>>>> Add a basic per-cpu counter for our references instead, and use that. >>>> >>>> All the synchronisation heavy lifting is done by RCU, what >>>> makes it safe to read other CPUs counters in >>>> io_ring_ref_maybe_done()? >>> >>> Other options are expedited RCU (Paul saying it's an order of >>> magnitude faster), or to switch to plain atomics since it's cached, >>> but it's only good if submitter and waiter are the same task. Paul >> >> I mixed it with task refs, ctx refs should be cached well >> for any configuration as they're bound to requests (and req >> caches). > > That's a good point, maybe even our current RCU approach is overkill > since we do the caching pretty well. Let me run a quick test, just > switching this to a basic atomic_t. The dead mask can just be the 31st > bit. Well, the exception is non-local task_work, we still grab and put a reference on the ctx for each context while iterating. Outside of that, the request pre-alloc takes care of the rest. -- Jens Axboe