On 4/4/23 9:48?AM, Gabriel Krisman Bertazi wrote: > Pavel Begunkov <asml.silence@xxxxxxxxx> writes: > >> On 4/1/23 01:04, Gabriel Krisman Bertazi wrote: >>> Pavel Begunkov <asml.silence@xxxxxxxxx> writes: > >>>> I didn't try it, but kmem_cache vs kmalloc, IIRC, doesn't bring us >>>> much, definitely doesn't spare from locking, and the overhead >>>> definitely wasn't satisfactory for requests before. >>> There is no locks in the fast path of slub, as far as I know. it has >>> a >>> per-cpu cache that is refilled once empty, quite similar to the fastpath >>> of this cache. I imagine the performance hit in slub comes from the >>> barrier and atomic operations? >> >> Yeah, I mean all kinds of synchronisation. And I don't think >> that's the main offender here, the test is single threaded without >> contention and the system was mostly idle. >> >>> kmem_cache works fine for most hot paths of the kernel. I think this >> >> It doesn't for io_uring. There are caches for the net side and now >> in the block layer as well. I wouldn't say it necessarily halves >> performance but definitely takes a share of CPU. > > Right. My point is that all these caches (block, io_uring) duplicate > what the slab cache is meant to do. Since slab became a bottleneck, I'm > looking at how to improve the situation on their side, to see if we can > drop the caching here and in block/. That would certainly be a worthy goal, and I do agree that these caches are (largely) working around deficiencies. One important point that you may miss is that most of this caching gets its performance from both avoiding atomics in slub, but also because we can guarantee that both alloc and free happen from process context. The block IRQ bits are a bit different, but apart from that, it's true elsewhere. Caching that needs to even disable IRQs locally generally doesn't beat out slub by much, the big wins are the cases where we know free+alloc is done in process context. -- Jens Axboe