Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring

Bernd Schubert <bschubert@xxxxxxx> · Wed, 12 Jun 2024 13:32:53 +0000

On 6/12/24 09:39, Miklos Szeredi wrote:
> On Tue, 11 Jun 2024 at 19:37, Bernd Schubert <bernd.schubert@xxxxxxxxxxx> wrote:
> 
>>> So I don't think it matters to performance whether there's a combined
>>> WRITEV + READV (or COMMIT + FETCH) op or separate ops.
>>
>> This has to be performance proven and is no means what I'm seeing. How
>> should io-uring improve performance if you have the same number of
>> system calls?
> 
> The ops can be queued together and submitted together.  Two separate
> (but possibly linked) ops should result in exactly the same number of
> syscalls as a single combined op.

As I wrote before, that requires the double amount of queue buffer memory.
Goes totally into the opposite direction of what I'm currently working
on - to use less memory, as not all requests need 1MB buffers and as we want
to use as little as possible memory.

> 
>> Also, if you are using IORING_OP_READV/IORING_OP_WRITEV, nothing would
>> change in fuse kernel? I.e. IOs would go via fuse_dev_read()?
>> I.e. we would not have encoded in the request which queue it belongs to?
> 
> The original idea was to use the cloned /dev/fuse fd to sort requests
> into separate queues.  That was only half finished: the input queue is
> currently shared by all clones, but once a request is read by the
> server from a particular clone it is put into a separate processing
> queue.   Adding separate input queues to each clone should also be
> possible.
> 
> I'm not saying this is definitely the direction we should be taking,
> but it's something to consider.

I was considering to use that for the mmap method, but then found it easier
to just add per numa to an rb-tree. Well, maybe I should reconsider, as
the current patch series clones the device anyway.

> 
>>> The advantage of separate ops is more flexibility and less complexity
>>> (do only one thing in an op)
>>
>> Did you look at patch 12/19? It just does
>> fuse_uring_req_end_and_get_next(). That part isn't complex, imho.
> 
> That function name indicates that this is too complex: it is doing two
> independent things (ending one request and fetching the next).
> 
> Fine if it's a valid optimization, but I'm saying that it likely isn't.

Would it help, if it would be in two lines? Like
fuse_uring_req_end();
fuse_uring_req_next();

It has to check if there are requests queued, as it goes on hold if not
and then won't process the queue.

> 
>>> The major difference between your idea of a fuse_uring and the
>>> io_uring seems to be that you place not only the request on the shared
>>> buffer, but the data as well.   I don't think this is a good idea,
>>> since it will often incur one more memory copy.  Otherwise the idea
>>> itself seems sound.
>>
>> Coud you explain what you mean with "one more memory copy"?
> 
> If the filesystem is providing the result of a READ request as a
> pointer to a buffer (which can be the case with fuse_reply_data()),
> then that buffer will need to be copied to the shared buffer, and from
> the shared buffer to the read destination.
> 
> That first copy is unnecessary if the kernel receives the pointer to
> the userspace buffer and copies the data directly to the destination.

I didn't do that yet, as we are going to use the ring buffer for requests,
i.e. the ring buffer immediately gets all the data from network, there is
no copy. Even if the ring buffer would get data from local disk - there
is no need to use a separate application buffer anymore. And with that
there is just no extra copy

Keep in mind that the ring buffers are coupled with the request and not
the processing thread as in current libfuse - the buffer is valid as
long as the request is not completed. That part is probably harder with
IORING_OP_READV/IORING_OP_WRITEV, especially if you need two of them.

Your idea sounds useful if userspace would have its own cache outside
of ring buffers and that could be added in as another optimization as
something like:

- fuse-ring-req gets a user pointer
- flag if user pointer it set
- kernel acts on the flag
- completion gets send to userspace by:
    - if next request is already in the queue : piggy-back completion
      into the next request
    - if no next request: send a separate completion message

If you want, I could add that as another optimization patch to the next RFC.

Maybe that is also useful for existing libfuse applications that are used
to the fact that the request buffer is associated with the thread and not
the request itself.

If we want real zero copy, we can add in Mings work on that.
I'm going to look into that today or tomorrow.

> 
>>> So I think either better integration with io_uring is needed with
>>> support for "reverse submission" or a new interface.
>>
>> Well, that is exactly what IORING_OP_URING_CMD is for, afaik. And
>> ublk_drv  also works exactly that way. I had pointed it out before,
>> initially I had considered to write a reverse io-uring myself and then
>> exactly at that time ublk came up.
> 
> I'm just looking for answers why this architecture is the best.  Maybe
> it is, but I find it too complex and can't explain why it's going to
> perform better than a much simpler single ring model.
> 
>> The interface of that 'reverse io' to io-uring is really simple.
>>
>> 1) Userspace sends a IORING_OP_URING_CMD SQE
>> 2) That CMD gets handled/queued by struct file_operations::uring_cmd /
>> fuse_uring_cmd(). fuse_uring_cmd() returns -EIOCBQUEUED and queues the
>> request
>> 3) When fuse client has data to complete the request, it calls
>> io_uring_cmd_done() and fuse server receives a CQE with the fuse request.
>>
>> Personally I don't see anything twisted here, one just needs to
>> understand that IORING_OP_URING_CMD was written for that reverse order.
> 
> That's just my gut feeling.   fuse/dev_uring.c is 1233 in this RFC.
> And that's just the queuing.

Dunno, from my point of view the main logic is so much simpler than what
fuse_dev_do_read() has - that function checks multiple queues, takes multiple
locks, has to add the fuse-req to a (now hashed) list and has restart loop.
If possible, I really wouldn't like to make that even more complex.

But we want to add it:

- Selecting different ring entries based on their io-size (what I'm currently
adding in, to reduce memory per queue). I don't think that
with the current wake logic that would be even possible

- Add in credits for for different IO types to avoid that an IO type
can fill the entire queue. Right now the ring has only have separation of
sync/async, but I don't think that is enough.

To compare, could you please check the code flow of
FUSE_URING_REQ_COMMIT_AND_FETCH? I have no issue to split
fuse_uring_req_end_and_get_next() into two functions.

What I mean is that the code flow is hopefully not be hard to follow,
it ends the request and then puts the entry on the avail lets
Then it checks if there is pending fuse request and handles that,
from my point of view that is much easier to follow and with less conditions
than what fuse_dev_do_read() has. And that although there is already
a separation of sync and async queues.

> 
>> (There came up a light twisting when io-uring introduced issue_flags -
>> that is part of discussion of patch 19/19 with Jens in the series. Jens
>> suggested to work on io-uring improvements once the main series is
>> merged. I.e. patch 19/19 will be dropped in RFCv3 and I'm going to ask
>> Jens for help once the other parts are merged. Right now that easy to
>> work around by always submitting with an io-uring task).
>>
>> Also, that simplicity is the reason why I'm hesitating a bit to work on
>> Kents new ring, as io-uring already has all what we need and with a
>> rather simple interface.
> 
> I'm in favor of using io_uring, if possible.

I have no objections, but I would also like to see an RHEL version with it...

> 
> I'm also in favor of a single shared buffer (ring) if possible.  Using
> cloned fd + plain READV / WRITEV ops is one possibility.

Cons:
- READV / WRITEV would need to be coupled in order to avoid two io-uring
cmd-enter system calls - double amount of memory
- Not impossible, but harder to achieve - request buffer belongs
to the request itself.
- request data section and fuse header are not clearly separated,
data alignment compat issues.
- different IO sizes hard to impossible - with large queue sizes high memory
usage, even if the majority would need small requests only
- Request type credits much harder to achieve
- The vfs application cannot directly write into the ring buffer, reduces
future optimizations
- new zero copy approach Ming Lei is working on cannot be used
- Not very flexible for future additions, IMHO

Pros:
- probably less code additions
- No shutdown issues
- existing splice works

> 
> But I'm not opposed to IORING_OP_URING_CMD either.   Btw, fuse reply
> could be inlined in the majority of cases into that 80 byte free space
> in the sqe.  Also might consider an extended cqe mode, where short
> fuse request could be inlined as well (e.g. IORING_SETUP_CQE128 -> 112
> byte payload).

That conflicts with that we want to have the fuse header in a separate section
to avoid alignment and compat issues. In fact, we might even want to make that
header section depending on the system page size.
And then what would be the advantage, we have the buffer anyway?

Thanks,
Bernd

PS: What I definitely realize that I should have talked at LSFMM2023 why
I had taken that approach and should have reduced slides about the
architecture and performance.