Jens Axboe <axboe@xxxxxxxxx> writes: > On 4/4/23 9:48?AM, Gabriel Krisman Bertazi wrote: >> Pavel Begunkov <asml.silence@xxxxxxxxx> writes: >> >>> On 4/1/23 01:04, Gabriel Krisman Bertazi wrote: >>>> Pavel Begunkov <asml.silence@xxxxxxxxx> writes: >> >>>>> I didn't try it, but kmem_cache vs kmalloc, IIRC, doesn't bring us >>>>> much, definitely doesn't spare from locking, and the overhead >>>>> definitely wasn't satisfactory for requests before. >>>> There is no locks in the fast path of slub, as far as I know. it has >>>> a >>>> per-cpu cache that is refilled once empty, quite similar to the fastpath >>>> of this cache. I imagine the performance hit in slub comes from the >>>> barrier and atomic operations? >>> >>> Yeah, I mean all kinds of synchronisation. And I don't think >>> that's the main offender here, the test is single threaded without >>> contention and the system was mostly idle. >>> >>>> kmem_cache works fine for most hot paths of the kernel. I think this >>> >>> It doesn't for io_uring. There are caches for the net side and now >>> in the block layer as well. I wouldn't say it necessarily halves >>> performance but definitely takes a share of CPU. >> >> Right. My point is that all these caches (block, io_uring) duplicate >> what the slab cache is meant to do. Since slab became a bottleneck, I'm >> looking at how to improve the situation on their side, to see if we can >> drop the caching here and in block/. > > That would certainly be a worthy goal, and I do agree that these caches > are (largely) working around deficiencies. One important point that you > may miss is that most of this caching gets its performance from both > avoiding atomics in slub, but also because we can guarantee that both > alloc and free happen from process context. The block IRQ bits are a bit > different, but apart from that, it's true elsewhere. Caching that needs > to even disable IRQs locally generally doesn't beat out slub by much, > the big wins are the cases where we know free+alloc is done in process > context. Yes, I noticed that. I was thinking of exposing a flag at kmem_cache creation-time to tell slab the user promises not to use it in IRQ context, so it doesn't need to worry about nested invocation in the allocation/free path. Then, for those caches, have a kmem_cache_alloc_locked variant, where the synchronization is maintained by the caller (i.e. by ->uring_lock here), so it can manipulate the cache without atomics. I was looking at your implementation of the block cache for inspiration and saw how you kept a second list for IRQ. I'm thinking how to fit a similar change inside slub. But for now, I want to get the simpler case, which is all io_uring need. I'll try to get a prototype before lsfmm and see if I get the MM folks input there. -- Gabriel Krisman Bertazi