On 5/13/20 1:20 PM, Pekka Enberg wrote: > > Hi, > > On Wed, May 13, 2020 at 6:30 PM Jens Axboe <axboe@xxxxxxxxx> wrote: >>> I turned the quick'n dirty from the other day into something a bit >>> more done. Would be great if someone else could run some >>> performance testing with this, I get about a 10% boost on the pure >>> NOP benchmark with this. But that's just on my laptop in qemu, so >>> some real iron testing would be awesome. > > On 5/13/20 8:42 PM, Jann Horn wrote:> +slab allocator people >> 10% boost compared to which allocator? Are you using CONFIG_SLUB? > > On Wed, May 13, 2020 at 6:30 PM Jens Axboe <axboe@xxxxxxxxx> wrote: >>> The idea here is to have a percpu alloc cache. There's two sets of >>> state: >>> >>> 1) Requests that have IRQ completion. preempt disable is not >>> enough there, we need to disable local irqs. This is a lot slower >>> in certain setups, so we keep this separate. >>> >>> 2) No IRQ completion, we can get by with just disabling preempt. > > On 5/13/20 8:42 PM, Jann Horn wrote:> +slab allocator people >> The SLUB allocator has percpu caching, too, and as long as you don't >> enable any SLUB debugging or ASAN or such, and you're not hitting >> any slowpath processing, it doesn't even have to disable interrupts, >> it gets away with cmpxchg_double. > > The struct io_kiocb is 240 bytes. I don't see a dedicated slab for it in > /proc/slabinfo on my machine, so it likely got merged to the kmalloc-256 > cache. This means that there's 32 objects in the per-CPU cache. Jens, on > the other hand, made the cache much bigger: Right, it gets merged with kmalloc-256 (and 5 others) in my testing. > +#define IO_KIOCB_CACHE_MAX 256 > > So I assume if someone does "perf record", they will see significant > reduction in page allocator activity with Jens' patch. One possible way > around that is forcing the page allocation order to be much higher. IOW, > something like the following completely untested patch: Now tested, I gave it a shot. This seems to bring performance to basically what the io_uring patch does, so that's great! Again, just in the microbenchmark test case, so freshly booted and just running the case. Will this patch introduce latencies or non-deterministic behavior for a fragmented system? -- Jens Axboe