Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring

Bernd Schubert <bernd.schubert@xxxxxxxxxxx> · Tue, 11 Jun 2024 19:37:30 +0200

On 6/11/24 17:35, Miklos Szeredi wrote:
> On Tue, 11 Jun 2024 at 12:26, Bernd Schubert <bernd.schubert@xxxxxxxxxxx> wrote:
> 
>> Secondly, with IORING_OP_URING_CMD we already have only a single command
>> to submit requests and fetch the next one - half of the system calls.
>>
>> Wouldn't IORING_OP_READV/IORING_OP_WRITEV be just this approach?
>> https://github.com/uroni/fuseuring?
>> I.e. it hook into the existing fuse and just changes from read()/write()
>> of /dev/fuse to io-uring of /dev/fuse. With the disadvantage of zero
>> control which ring/queue and which ring-entry handles the request.
> 
> Unlike system calls, io_uring ops should have very little overhead.
> That's one of the main selling points of io_uring (as described in the
> io_uring(7) man page).
> 
> So I don't think it matters to performance whether there's a combined
> WRITEV + READV (or COMMIT + FETCH) op or separate ops.

This has to be performance proven and is no means what I'm seeing. How
should io-uring improve performance if you have the same number of
system calls?

As I see it (@Jens or @Pavel or anyone else please correct me if I'm
wrong), advantage of io-uring comes when there is no syscall overhead at
all - either you have a ring with multiple entries and then one side
operates on multiple entries or you have polling and no syscall overhead
either. We cannot afford cpu intensive polling - out of question,
besides that I had even tried SQPOLL and it made things worse (that is
actually where my idea about application polling comes from).
As I see it, for sync blocking calls (like meta operations) with one
entry in the queue, you would get no advantage with
IORING_OP_READV/IORING_OP_WRITEV -  io-uring has  do two system calls -
one to submit from kernel to userspace and another from userspace to
kernel. Why should io-uring be faster there?

And from my testing this is exactly what I had seen - io-uring for meta
requests (i.e. without a large request queue and *without* core
affinity) makes meta operations even slower that /dev/fuse.

For anything that imposes a large ring queue and where either side
(kernel or userspace) needs to process multiple ring entries - system
call overhead gets reduced by the queue size. Just for DIO or meta
operations that is hard to reach.

Also, if you are using IORING_OP_READV/IORING_OP_WRITEV, nothing would
change in fuse kernel? I.e. IOs would go via fuse_dev_read()?
I.e. we would not have encoded in the request which queue it belongs to?

> 
> The advantage of separate ops is more flexibility and less complexity
> (do only one thing in an op)

Did you look at patch 12/19? It just does
fuse_uring_req_end_and_get_next(). That part isn't complex, imho.

> 
>> Thirdly, initially I had even removed the allocation of 'struct
>> fuse_req' and directly allocated these on available ring entries. I.e.
>> the application thread was writing to the mmap ring buffer. I just
>> removed that code for now, as it introduced additional complexity with
>> an unknown performance outcome. If it should be helpful we could add
>> that later. I don't think we have that flexibility with
>> IORING_OP_READV/IORING_OP_WRITEV.
> 
> I think I understand what you'd like to see in the end: basically a
> reverse io_uring, where requests are placed on a "submission queue" by
> the kernel and completed requests are placed on a completion queue by
> the userspace.  Exactly the opposite of current io_uring.
> 
> The major difference between your idea of a fuse_uring and the
> io_uring seems to be that you place not only the request on the shared
> buffer, but the data as well.   I don't think this is a good idea,
> since it will often incur one more memory copy.  Otherwise the idea
> itself seems sound.

Coud you explain what you mean with "one more memory copy"? As it is
right now, 'struct fuse_req' is always allocated as it was before and
then a copy is done to the ring entry. No difference to legacy /dev/fuse
IO, which also copies to the read buffer.

If we would avoid allocating 'struct fuse_req' when there are free ring
entry requests we would reduce copies, but never increase?

Btw, advantage for the ring is on the libfuse side, where the
fuse-request buffer is assigned to the CQE and as long as the request is
not completed, the buffer is valid. (For /dev/fuse IO that could be
solved in libfuse by decoupling request memory from the thread, but with
the current ring buffer design that just happens more naturally and
memory is limited by the queue size.)

> 
> The implementation twisted due to having to integrate it with
> io_uring.  Unfortunately placing fuse requests directly into the
> io_uring queue doesn't work, due to the reversal of roles and the size
> difference between sqe and cqe entries.  Also the shared buffer seems
> to lose its ring aspect due to the fact that fuse doesn't get notified
> when a request is taken off the queue (io_uring_cqe_seen(3)).
> 
> So I think either better integration with io_uring is needed with
> support for "reverse submission" or a new interface.

Well, that is exactly what IORING_OP_URING_CMD is for, afaik. And
ublk_drv  also works exactly that way. I had pointed it out before,
initially I had considered to write a reverse io-uring myself and then
exactly at that time ublk came up.

The interface of that 'reverse io' to io-uring is really simple.

1) Userspace sends a IORING_OP_URING_CMD SQE
2) That CMD gets handled/queued by struct file_operations::uring_cmd /
fuse_uring_cmd(). fuse_uring_cmd() returns -EIOCBQUEUED and queues the
request
3) When fuse client has data to complete the request, it calls
io_uring_cmd_done() and fuse server receives a CQE with the fuse request.

Personally I don't see anything twisted here, one just needs to
understand that IORING_OP_URING_CMD was written for that reverse order.

(There came up a light twisting when io-uring introduced issue_flags -
that is part of discussion of patch 19/19 with Jens in the series. Jens
suggested to work on io-uring improvements once the main series is
merged. I.e. patch 19/19 will be dropped in RFCv3 and I'm going to ask
Jens for help once the other parts are merged. Right now that easy to
work around by always submitting with an io-uring task).

Also, that simplicity is the reason why I'm hesitating a bit to work on
Kents new ring, as io-uring already has all what we need and with a
rather simple interface.

Well, maybe you mean patch 09/19 "Add a dev_release exception for
fuse-over-io-uring". Yep, that is the shutdown part I'm not too happy
about and which initially lead to the async release thread in RFCv1.

> 
>>>
>>> Maybe there's an advantage in using an atomic op for WRITEV + READV,
>>> but I'm not quite seeing it yet, since there's no syscall overhead for
>>> separate ops.
>>
>> Here I get confused, could please explain?
>> Current fuse has a read + write operation - a read() system call to
>> process a fuse request and a write() call to submit the result and then
>> read() to fetch the next request.
>> If userspace has to send IORING_OP_READV to fetch a request and complete
>> with IORING_OP_IORING_OP_WRITEV it would go through existing code path
>> with operations? Well, maybe userspace could submit with IOSQE_IO_LINK,
>> but that sounds like it would need to send two ring entries? I.e. memory
>> and processing overhead?
> 
> Overhead should be minimal.

See above, for single entry blocking requests you get two system calls +
io-uring overhead.

> 
>> And then, no way to further optimize and do fuse_req allocation on the
>> ring (if there are free entries). And probably also no way that we ever
>> let the application work in the SQPOLL way, because the application
>> thread does not have the right to read from the fuse-server buffer? I.e.
>> what I mean is that IORING_OP_URING_CMD gives a better flexibility.
> 
> There should be no difference between IORING_OP_URING_CMD and
> IORING_OP_WRITEV +  IORING_OP_READV in this respect.  At least I don't
> see why polling would work differently: the writev should complete
> immediately and then the readv is queued.  Same as what effectively
> happens with IORING_OP_URING_CMD, no?

Polling yes, but without shared memory the application thread does not
have the right to read from fuse userspace server request?

> 
>> Btw, another issue that is avoided with the new ring-request layout is
>> compatibility and alignment. The fuse header is always in a 4K section
>> of the request data follow then. I.e. extending request sizes does not
>> impose compatibility issues any more and also for passthrough and
>> similar - O_DIRECT can be passed through to the backend file system.
>> Although these issues probably need to be solved into the current fuse
>> protocol.
> 
> Yes.
> 
>> Last but not least, with separation there is no global async queue
>> anymore - no global lock and cache issues.
> 
> The global async code should be moved into the /dev/fuse specific
> "legacy" queuing so it doesn't affect either uring or virtiofs
> queuing.

Yep, wait a few days I have seen your recent patch and I'm may to add
that to my series. I actually considered to point of that the bg queue
could be handled by that series as well, but then preferred to just add
patch for that in my series, which will make use of it for the ring queue.

> 
>>>> https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@xxxxxxxxxxxxxx/
>>>>
>>>> This cache line bouncing should be addressed by these patches
>>>> as well.
>>>
>>> Why do you think this is solved?
>>
>>
>> I _guess_ that writing to the mmaped buffer and processing requests on
>> the same cpu core should make it possible to keep things in cpu cache. I
>> did not verify that with perf, though.
> 
> Well, the issue is with any context switch that happens in the
> multithreaded fuse server process.  Shared address spaces will have a
> common "which CPU this is currently running on" bitmap
> (mm->cpu_bitmap), which is updated whenever one of the threads using
> this address space gets scheduled or descheduled.
> 
> Now imagine a fuse server running on a big numa system, which has
> threads bound to each CPU.  The server tries to avoid using sharing
> data structures between threads, so that cache remains local.  But it
> can't avoid updating this bitmap on schedule.  The bitmap can pack 512
> CPUs into a single cacheline, which means that thread locality is
> compromised.

To be honest, I wonder how you worked around scheduler issues on waking
up the application thread. Did you core bind application threads as well
(I mean besides fuse server threads)? We now have this (unexported)
wake_on_current_cpu. Last year that still wasn't working perfectly well
and  Hillf Danton has suggested the 'seesaw' approach. And with that the
scheduler was working very well. You could get the same with application
core binding, but with 512 CPUs that is certainly not done manually
anymore. Did you use a script to bind application threads or did you
core bind from within the application?

> 
> I'm somewhat surprised that this doesn't turn up in profiles in real
> life, but I guess it's not a big deal in most cases.  I only observed
> it with a special "no-op" fuse server running on big numa and with
> per-thread queuing, etc. enabled (fuse2).

Ok, I'm testing only with 32 cores and two numa nodes. For final
benchmarking I could try to get a more recent AMD based system with 96
cores. I don't think we have anything near 512 CPUs in the lab. I'm not
aware of such customer systems either.

> 
>> For sync requests getting the scheduler involved is what is responsible
>> for making really fuse slow. It schedules on random cores, that are in
>> sleep states and additionally frequency scaling does not go up. We
>> really need to stay on the same core here, as that is submitting the
>> result, the core is already running (i.e. not sleeping) and has data in
>> its cache. All benchmark results with sync requests point that out.
> 
> No arguments about that.
> 
>> For async requests, the above still applies, but basically one thread is
>> writing/reading and the other thread handles/provides the data. Random
>> switching of cores is then still not good. At best and to be tested,
>> submitting rather large chunks to other cores.
>> What is indeed to be discussed (and think annotated in the corresponding
>> patch description), if there is a way to figure out if the other core is
>> already busy. But then the scheduler does not know what it is busy with
>> - are these existing fuse requests or something else. That part is
>> really hard and I don't think it makes sense to discuss this right now
>> before the main part is merged. IMHO, better to add a config flag for
>> the current cpu+1 scheduling with an annotation that this setting might
>> go away in the future.
> 
> The cpu + 1 seems pretty arbitrary, and could be a very bad decision
> if there are independent tasks bound to certain CPUs or if the target
> turns out to be on a very distant CPU.
> 
> I'm not sure what the right answer is.   It's probably something like:
> try to schedule this on a CPU which is not busy but is not very
> distant from this one.  The scheduler can probably answer this
> question, but there's no current API for this.

This is just another optimization. What you write is true and I was
aware of that. It was just a rather simple optimization that improved
results - enough to demo it. We can work with scheduler people in the
future on that or we add a bit of our own logic and create mapping of
cpu -> next-cpu-on-same-numa-node. Certainly an API and help from the
scheduler would be preferred.

Thanks,
Bernd