On 5/10/20 9:30 PM, Xiaoguang Wang wrote: > hi, > >> On 5/8/20 9:18 AM, Xiaoguang Wang wrote: >>> hi, >>> >>> This issue was found when I tested IORING_FEAT_FAST_POLL feature, with >>> the newest upstream codes, indeed I find that io_uring's performace >>> improvement is not obvious compared to epoll in my test environment, >>> most of the time they are similar. Test cases basically comes from: >>> https://github.com/frevib/io_uring-echo-server/blob/io-uring-feat-fast-poll/benchmarks/benchmarks.md. >>> In above url, the author's test results shows that io_uring will get a >>> big performace improvement compared to epoll. I'm still looking into >>> why I don't get the big improvement, currently don't know why, but I >>> find some obvious regression issue. >>> >>> I wrote a simple tool based io_uring nop operation to evaluate >>> io_uring framework in v5.1 and 5.7.0-rc4+(jens's io_uring-5.7 branch), >>> I see a obvious performace regression: >>> >>> v5.1 kernel: >>> $sudo taskset -c 60 ./io_uring_nop_stress -r 300 # run 300 seconds >>> total ios: 1832524960 >>> IOPS: 6108416 >>> 5.7.0-rc4+ >>> $sudo taskset -c 60 ./io_uring_nop_stress -r 300 >>> total ios: 1597672304 >>> IOPS: 5325574 >>> it's about 12% performance regression. >> >> For sure there's a bit more bloat in 5.7+ compared to the initial slim >> version, and focus has been on features to a certain extent recently. >> The poll rework for 5.7 will really improve performance for the >> networked side though, so it's not like it's just piling on features >> that add bloat. >> >> That said, I do think it's time for a revisit on overhead. It's been a >> while since I've done my disk IO testing. The nop testing isn't _that_ >> interesting by itself, as a micro benchmark it does yield some results >> though. Are you running on bare metal or in a VM? > I run it on bare metal. > >> >>> Using perf can see many performance bottlenecks, for example, >>> io_submit_sqes is one. For now, I did't make many analysis yet, just >>> have a look at io_submit_sqes(), there are many assignment operations >>> in io_init_req(), but I'm not sure whether they are all needed when >>> req is not needed to be punt to io-wq, for example, >>> INIT_IO_WORK(&req->work, io_wq_submit_work); # a whole struct >>> assignment from perf annotate tool, it's an expensive operation, I >>> think reqs that use fast poll feature use task-work function, so the >>> INIT_IO_WORK maybe not necessary. >> >> I'm sure there's some low hanging fruit there, and I'd love to take >> patches for it. > Ok, I'll try to improve it a bit, thanks. > >> >>> Above is just one issue, what I worry is that whether io_uring is >>> becoming more bloated gradually, and will not that better to aio. In >>> https://kernel.dk/io_uring.pdf, it says that io_uring will eliminate >>> 104 bytes copy compared to aio, but see currenct io_init_req(), >>> io_uring maybe copy more, introducing more overhead? Or does we need >>> to carefully re-design struct io_kiocb, to reduce overhead as soon as >>> possible. >> >> The copy refers to the data structures coming in and out, both io_uring >> and io_uring inititalize their main io_kiocb/aio_kiocb structure as >> well. The io_uring is slightly bigger, but not much, and it's the same >> number of cachelines. So should not be a huge difference there. The >> copying on the aio side is basically first the pointer copy, then the >> user side kiocb structure. io_uring doesn't need to do that. The >> completion side is also slimmer. We also don't need as many system calls >> to do the same thing, for example. >> >> So no, we should always been substantially slimmer than aio, just by the >> very nature of the API. > Really thanks for detailed explanations, now I see, feel good:) > >> >> One major thing I've been thinking about for io_uring is io_kiocb >> recycling. We're hitting the memory allocator for alloc+free for each >> request, even though that can be somewhat amortized by doing batched >> submissions, and polling for instance can also do batched frees. But I'm >> pretty sure we can find some gains here by having some io_kiocb caching >> that is persistent across operations. > Yes, agree. From my above nop operation tests, using perf, it shows > kmem_cache_free() and kmem_cache_alloc_bulk() introduce an considerable > overhead. Just did a quick'n dirty where we have a global per-cpu cache of io_kiocbs, so we can recycle them. I'm using my laptop for this testing, running in kvm. This isn't ideal, as the spinlock/irq overhead is much larger there, but it's the same across all runs. Baseline performance, this is running your test app as-is, no changes: # taskset -c 0 ./io_uring_nop_stress total ios: 294158288 IOPS: 9805276 Then I disabled the bulk alloc optimization, so we don't get any benefits of allocating 16 requests at the time: # taskset -c 0 ./io_uring_nop_stress total ios: 285341456 IOPS: 9511381 As expected, we're down a bit at that point, as we're allocating each io_kiocb individually. Then I tried with the quick'n dirty patch: # taskset -c 0 ./io_uring_nop_stress total ios: 325797040 IOPS: 10859901 and it's quite a bit faster. This is cheating a little bit, since these are just nops and we don't have to disable local irqs while grabbing a recycled entry. But just adding that and it should be safe to use, and probably won't make much of a difference on actual hardware. The main concern for me is how to manage this cache. We could just limit it to X entries per cpu or something like that, but it'd be nice to be able to respond to memory pressure. Quick'n dirty below. diff --git a/fs/io_uring.c b/fs/io_uring.c index 4c2fe06ae20b..b6a6d0f0f4e5 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -673,6 +673,13 @@ struct io_submit_state { unsigned int ios_left; }; +struct io_kiocb_cache { + struct list_head list; + unsigned nr_free; +}; + +static struct io_kiocb_cache *alloc_cache; + struct io_op_def { /* needs req->io allocated for deferral/async */ unsigned async_ctx : 1; @@ -1293,6 +1300,29 @@ static struct io_kiocb *io_get_fallback_req(struct io_ring_ctx *ctx) return NULL; } +static struct io_kiocb *io_cache_alloc(void) +{ + struct io_kiocb_cache *cache; + struct io_kiocb *req = NULL; + + cache = this_cpu_ptr(alloc_cache); + if (!list_empty(&cache->list)) { + req = list_first_entry(&cache->list, struct io_kiocb, list); + list_del(&req->list); + return req; + } + + return NULL; +} + +static void io_cache_free(struct io_kiocb *req) +{ + struct io_kiocb_cache *cache; + + cache = this_cpu_ptr(alloc_cache); + list_add(&req->list, &cache->list); +} + static struct io_kiocb *io_alloc_req(struct io_ring_ctx *ctx, struct io_submit_state *state) { @@ -1300,6 +1330,9 @@ static struct io_kiocb *io_alloc_req(struct io_ring_ctx *ctx, struct io_kiocb *req; if (!state) { + req = io_cache_alloc(); + if (req) + return req; req = kmem_cache_alloc(req_cachep, gfp); if (unlikely(!req)) goto fallback; @@ -1372,7 +1405,7 @@ static void __io_free_req(struct io_kiocb *req) percpu_ref_put(&req->ctx->refs); if (likely(!io_is_fallback_req(req))) - kmem_cache_free(req_cachep, req); + io_cache_free(req); else clear_bit_unlock(0, (unsigned long *) &req->ctx->fallback_req); } @@ -5842,7 +5875,7 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req, static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, struct file *ring_file, int ring_fd, bool async) { - struct io_submit_state state, *statep = NULL; + //struct io_submit_state state, *statep = NULL; struct io_kiocb *link = NULL; int i, submitted = 0; @@ -5859,10 +5892,12 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, if (!percpu_ref_tryget_many(&ctx->refs, nr)) return -EAGAIN; +#if 0 if (nr > IO_PLUG_THRESHOLD) { io_submit_state_start(&state, nr); statep = &state; } +#endif ctx->ring_fd = ring_fd; ctx->ring_file = ring_file; @@ -5877,14 +5912,14 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, io_consume_sqe(ctx); break; } - req = io_alloc_req(ctx, statep); + req = io_alloc_req(ctx, NULL); if (unlikely(!req)) { if (!submitted) submitted = -EAGAIN; break; } - err = io_init_req(ctx, req, sqe, statep, async); + err = io_init_req(ctx, req, sqe, NULL, async); io_consume_sqe(ctx); /* will complete beyond this point, count as submitted */ submitted++; @@ -5898,7 +5933,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, trace_io_uring_submit_sqe(ctx, req->opcode, req->user_data, true, async); - err = io_submit_sqe(req, sqe, statep, &link); + err = io_submit_sqe(req, sqe, NULL, &link); if (err) goto fail_req; } @@ -5910,8 +5945,10 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, } if (link) io_queue_link_head(link); +#if 0 if (statep) io_submit_state_end(&state); +#endif /* Commit SQ ring head once we've consumed and submitted all SQEs */ io_commit_sqring(ctx); @@ -8108,6 +8145,8 @@ SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode, static int __init io_uring_init(void) { + int cpu; + #define __BUILD_BUG_VERIFY_ELEMENT(stype, eoffset, etype, ename) do { \ BUILD_BUG_ON(offsetof(stype, ename) != eoffset); \ BUILD_BUG_ON(sizeof(etype) != sizeof_field(stype, ename)); \ @@ -8147,6 +8186,15 @@ static int __init io_uring_init(void) BUILD_BUG_ON(ARRAY_SIZE(io_op_defs) != IORING_OP_LAST); BUILD_BUG_ON(__REQ_F_LAST_BIT >= 8 * sizeof(int)); req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC); + + alloc_cache = alloc_percpu(struct io_kiocb_cache); + for_each_possible_cpu(cpu) { + struct io_kiocb_cache *cache; + + cache = per_cpu_ptr(alloc_cache, cpu); + INIT_LIST_HEAD(&cache->list); + cache->nr_free = 0; + } return 0; }; __initcall(io_uring_init); -- Jens Axboe