Re: [PATCH 10/11] io_uring/rsrc: cache struct io_rsrc_node

Gabriel Krisman Bertazi <krisman@xxxxxxx> · Tue, 04 Apr 2023 13:53:01 -0300

Jens Axboe <axboe@xxxxxxxxx> writes:

> On 4/4/23 9:48?AM, Gabriel Krisman Bertazi wrote:
>> Pavel Begunkov <asml.silence@xxxxxxxxx> writes:
>> 
>>> On 4/1/23 01:04, Gabriel Krisman Bertazi wrote:
>>>> Pavel Begunkov <asml.silence@xxxxxxxxx> writes:
>> 
>>>>> I didn't try it, but kmem_cache vs kmalloc, IIRC, doesn't bring us
>>>>> much, definitely doesn't spare from locking, and the overhead
>>>>> definitely wasn't satisfactory for requests before.
>>>> There is no locks in the fast path of slub, as far as I know.  it has
>>>> a
>>>> per-cpu cache that is refilled once empty, quite similar to the fastpath
>>>> of this cache.  I imagine the performance hit in slub comes from the
>>>> barrier and atomic operations?
>>>
>>> Yeah, I mean all kinds of synchronisation. And I don't think
>>> that's the main offender here, the test is single threaded without
>>> contention and the system was mostly idle.
>>>
>>>> kmem_cache works fine for most hot paths of the kernel.  I think this
>>>
>>> It doesn't for io_uring. There are caches for the net side and now
>>> in the block layer as well. I wouldn't say it necessarily halves
>>> performance but definitely takes a share of CPU.
>> 
>> Right.  My point is that all these caches (block, io_uring) duplicate
>> what the slab cache is meant to do.  Since slab became a bottleneck, I'm
>> looking at how to improve the situation on their side, to see if we can
>> drop the caching here and in block/.
>
> That would certainly be a worthy goal, and I do agree that these caches
> are (largely) working around deficiencies. One important point that you
> may miss is that most of this caching gets its performance from both
> avoiding atomics in slub, but also because we can guarantee that both
> alloc and free happen from process context. The block IRQ bits are a bit
> different, but apart from that, it's true elsewhere. Caching that needs
> to even disable IRQs locally generally doesn't beat out slub by much,
> the big wins are the cases where we know free+alloc is done in process
> context.

Yes, I noticed that.  I was thinking of exposing a flag at kmem_cache
creation-time to tell slab the user promises not to use it in IRQ
context, so it doesn't need to worry about nested invocation in the
allocation/free path.  Then, for those caches, have a
kmem_cache_alloc_locked variant, where the synchronization is maintained
by the caller (i.e. by ->uring_lock here), so it can manipulate the
cache without atomics.

I was looking at your implementation of the block cache for inspiration
and saw how you kept a second list for IRQ.  I'm thinking how to fit a
similar change inside slub.  But for now, I want to get the simpler
case, which is all io_uring need.

I'll try to get a prototype before lsfmm and see if I get the MM folks
input there.

-- 
Gabriel Krisman Bertazi