Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring

Miklos Szeredi <miklos@xxxxxxxxxx> · Wed, 12 Jun 2024 09:39:56 +0200

On Tue, 11 Jun 2024 at 19:37, Bernd Schubert <bernd.schubert@xxxxxxxxxxx> wrote:

> > So I don't think it matters to performance whether there's a combined
> > WRITEV + READV (or COMMIT + FETCH) op or separate ops.
>
> This has to be performance proven and is no means what I'm seeing. How
> should io-uring improve performance if you have the same number of
> system calls?

The ops can be queued together and submitted together.  Two separate
(but possibly linked) ops should result in exactly the same number of
syscalls as a single combined op.

> Also, if you are using IORING_OP_READV/IORING_OP_WRITEV, nothing would
> change in fuse kernel? I.e. IOs would go via fuse_dev_read()?
> I.e. we would not have encoded in the request which queue it belongs to?

The original idea was to use the cloned /dev/fuse fd to sort requests
into separate queues.  That was only half finished: the input queue is
currently shared by all clones, but once a request is read by the
server from a particular clone it is put into a separate processing
queue.   Adding separate input queues to each clone should also be
possible.

I'm not saying this is definitely the direction we should be taking,
but it's something to consider.

> > The advantage of separate ops is more flexibility and less complexity
> > (do only one thing in an op)
>
> Did you look at patch 12/19? It just does
> fuse_uring_req_end_and_get_next(). That part isn't complex, imho.

That function name indicates that this is too complex: it is doing two
independent things (ending one request and fetching the next).

Fine if it's a valid optimization, but I'm saying that it likely isn't.

> > The major difference between your idea of a fuse_uring and the
> > io_uring seems to be that you place not only the request on the shared
> > buffer, but the data as well.   I don't think this is a good idea,
> > since it will often incur one more memory copy.  Otherwise the idea
> > itself seems sound.
>
> Coud you explain what you mean with "one more memory copy"?

If the filesystem is providing the result of a READ request as a
pointer to a buffer (which can be the case with fuse_reply_data()),
then that buffer will need to be copied to the shared buffer, and from
the shared buffer to the read destination.

That first copy is unnecessary if the kernel receives the pointer to
the userspace buffer and copies the data directly to the destination.

> > So I think either better integration with io_uring is needed with
> > support for "reverse submission" or a new interface.
>
> Well, that is exactly what IORING_OP_URING_CMD is for, afaik. And
> ublk_drv  also works exactly that way. I had pointed it out before,
> initially I had considered to write a reverse io-uring myself and then
> exactly at that time ublk came up.

I'm just looking for answers why this architecture is the best.  Maybe
it is, but I find it too complex and can't explain why it's going to
perform better than a much simpler single ring model.

> The interface of that 'reverse io' to io-uring is really simple.
>
> 1) Userspace sends a IORING_OP_URING_CMD SQE
> 2) That CMD gets handled/queued by struct file_operations::uring_cmd /
> fuse_uring_cmd(). fuse_uring_cmd() returns -EIOCBQUEUED and queues the
> request
> 3) When fuse client has data to complete the request, it calls
> io_uring_cmd_done() and fuse server receives a CQE with the fuse request.
>
> Personally I don't see anything twisted here, one just needs to
> understand that IORING_OP_URING_CMD was written for that reverse order.

That's just my gut feeling.   fuse/dev_uring.c is 1233 in this RFC.
And that's just the queuing.

> (There came up a light twisting when io-uring introduced issue_flags -
> that is part of discussion of patch 19/19 with Jens in the series. Jens
> suggested to work on io-uring improvements once the main series is
> merged. I.e. patch 19/19 will be dropped in RFCv3 and I'm going to ask
> Jens for help once the other parts are merged. Right now that easy to
> work around by always submitting with an io-uring task).
>
> Also, that simplicity is the reason why I'm hesitating a bit to work on
> Kents new ring, as io-uring already has all what we need and with a
> rather simple interface.

I'm in favor of using io_uring, if possible.

I'm also in favor of a single shared buffer (ring) if possible.  Using
cloned fd + plain READV / WRITEV ops is one possibility.

But I'm not opposed to IORING_OP_URING_CMD either.   Btw, fuse reply
could be inlined in the majority of cases into that 80 byte free space
in the sqe.  Also might consider an extended cqe mode, where short
fuse request could be inlined as well (e.g. IORING_SETUP_CQE128 -> 112
byte payload).

> To be honest, I wonder how you worked around scheduler issues on waking
> up the application thread. Did you core bind application threads as well
> (I mean besides fuse server threads)? We now have this (unexported)
> wake_on_current_cpu. Last year that still wasn't working perfectly well
> and  Hillf Danton has suggested the 'seesaw' approach. And with that the
> scheduler was working very well. You could get the same with application
> core binding, but with 512 CPUs that is certainly not done manually
> anymore. Did you use a script to bind application threads or did you
> core bind from within the application?

Probably, I don't remember anymore.

Thanks,
Miklos