Re: [PATCH 4/4] io_uring: mark opcodes that always need io-wq punt

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 4/25/23 16:07, Ming Lei wrote:
On Tue, Apr 25, 2023 at 08:50:33AM -0600, Jens Axboe wrote:
On 4/25/23 8:42?AM, Ming Lei wrote:
On Tue, Apr 25, 2023 at 07:31:10AM -0600, Jens Axboe wrote:
On 4/24/23 8:50?PM, Ming Lei wrote:
On Mon, Apr 24, 2023 at 08:18:02PM -0600, Jens Axboe wrote:
On 4/24/23 8:13?PM, Ming Lei wrote:
On Mon, Apr 24, 2023 at 08:08:09PM -0600, Jens Axboe wrote:
On 4/24/23 6:57?PM, Ming Lei wrote:
On Mon, Apr 24, 2023 at 09:24:33AM -0600, Jens Axboe wrote:
On 4/24/23 1:30?AM, Ming Lei wrote:
On Thu, Apr 20, 2023 at 12:31:35PM -0600, Jens Axboe wrote:
Add an opdef bit for them, and set it for the opcodes where we always
need io-wq punt. With that done, exclude them from the file_can_poll()
check in terms of whether or not we need to punt them if any of the
NO_OFFLOAD flags are set.

Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
---
  io_uring/io_uring.c |  2 +-
  io_uring/opdef.c    | 22 ++++++++++++++++++++--
  io_uring/opdef.h    |  2 ++
  3 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index fee3e461e149..420cfd35ebc6 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1948,7 +1948,7 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
  		return -EBADF;
if (issue_flags & IO_URING_F_NO_OFFLOAD &&
-	    (!req->file || !file_can_poll(req->file)))
+	    (!req->file || !file_can_poll(req->file) || def->always_iowq))
  		issue_flags &= ~IO_URING_F_NONBLOCK;

I guess the check should be !def->always_iowq?

How so? Nobody that takes pollable files should/is setting
->always_iowq. If we can poll the file, we should not force inline
submission. Basically the ones setting ->always_iowq always do -EAGAIN
returns if nonblock == true.

I meant IO_URING_F_NONBLOCK is cleared here for  ->always_iowq, and
these OPs won't return -EAGAIN, then run in the current task context
directly.

Right, of IO_URING_F_NO_OFFLOAD is set, which is entirely the point of
it :-)

But ->always_iowq isn't actually _always_ since fallocate/fsync/... are
not punted to iowq in case of IO_URING_F_NO_OFFLOAD, looks the naming of
->always_iowq is a bit confusing?

Yeah naming isn't that great, I can see how that's bit confusing. I'll
be happy to take suggestions on what would make it clearer.

Except for the naming, I am also wondering why these ->always_iowq OPs
aren't punted to iowq in case of IO_URING_F_NO_OFFLOAD, given it
shouldn't improve performance by doing so because these OPs are supposed
to be slow and always slept, not like others(buffered writes, ...),
can you provide one hint about not offloading these OPs? Or is it just that
NO_OFFLOAD needs to not offload every OPs?

The whole point of NO_OFFLOAD is that items that would normally be
passed to io-wq are just run inline. This provides a way to reap the
benefits of batched submissions and syscall reductions. Some opcodes
will just never be async, and io-wq offloads are not very fast. Some of

Yeah, seems io-wq is much slower than inline issue, maybe it needs
to be looked into, and it is easy to run into io-wq for IOSQE_IO_LINK.

Indeed, depending on what is being linked, you may see io-wq activity
which is not ideal.

That is why I prefer to fused command for ublk zero copy, because the
registering buffer approach suggested by Pavel and Ziyang has to link
register buffer OP with the actual IO OP, and it is observed that
IOPS drops to 1/2 in 4k random io test with registered buffer approach.

What's good about it is that you can use linked requests with it
but you _don't have to_.

Curiously, I just recently compared submitting 8 two-request links
(16 reqs in total) vs submit(8)+submit(8), all that in a loop.
The latter was faster. It wasn't a clean experiment, but shows
that links are not super fast and would be nice to get them better.

For the register buf approach, I tried it out, looked good to me.
It outperforms splice requests (with a hack that removes force
iowq execution) by 5-10% with synthetic benchmark. Works better than
splice(2) for QD>=2. Let me send it out, perhaps today, so we can
figure out how it compares against ublk/fused and see the margin is.

--
Pavel Begunkov



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux