Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring

Miklos Szeredi <miklos@xxxxxxxxxx> · Tue, 11 Jun 2024 17:35:52 +0200

On Tue, 11 Jun 2024 at 12:26, Bernd Schubert <bernd.schubert@xxxxxxxxxxx> wrote:

> Secondly, with IORING_OP_URING_CMD we already have only a single command
> to submit requests and fetch the next one - half of the system calls.
>
> Wouldn't IORING_OP_READV/IORING_OP_WRITEV be just this approach?
> https://github.com/uroni/fuseuring?
> I.e. it hook into the existing fuse and just changes from read()/write()
> of /dev/fuse to io-uring of /dev/fuse. With the disadvantage of zero
> control which ring/queue and which ring-entry handles the request.

Unlike system calls, io_uring ops should have very little overhead.
That's one of the main selling points of io_uring (as described in the
io_uring(7) man page).

So I don't think it matters to performance whether there's a combined
WRITEV + READV (or COMMIT + FETCH) op or separate ops.

The advantage of separate ops is more flexibility and less complexity
(do only one thing in an op).

> Thirdly, initially I had even removed the allocation of 'struct
> fuse_req' and directly allocated these on available ring entries. I.e.
> the application thread was writing to the mmap ring buffer. I just
> removed that code for now, as it introduced additional complexity with
> an unknown performance outcome. If it should be helpful we could add
> that later. I don't think we have that flexibility with
> IORING_OP_READV/IORING_OP_WRITEV.

I think I understand what you'd like to see in the end: basically a
reverse io_uring, where requests are placed on a "submission queue" by
the kernel and completed requests are placed on a completion queue by
the userspace.  Exactly the opposite of current io_uring.

The major difference between your idea of a fuse_uring and the
io_uring seems to be that you place not only the request on the shared
buffer, but the data as well.   I don't think this is a good idea,
since it will often incur one more memory copy.  Otherwise the idea
itself seems sound.

The implementation twisted due to having to integrate it with
io_uring.  Unfortunately placing fuse requests directly into the
io_uring queue doesn't work, due to the reversal of roles and the size
difference between sqe and cqe entries.  Also the shared buffer seems
to lose its ring aspect due to the fact that fuse doesn't get notified
when a request is taken off the queue (io_uring_cqe_seen(3)).

So I think either better integration with io_uring is needed with
support for "reverse submission" or a new interface.

> >
> > Maybe there's an advantage in using an atomic op for WRITEV + READV,
> > but I'm not quite seeing it yet, since there's no syscall overhead for
> > separate ops.
>
> Here I get confused, could please explain?
> Current fuse has a read + write operation - a read() system call to
> process a fuse request and a write() call to submit the result and then
> read() to fetch the next request.
> If userspace has to send IORING_OP_READV to fetch a request and complete
> with IORING_OP_IORING_OP_WRITEV it would go through existing code path
> with operations? Well, maybe userspace could submit with IOSQE_IO_LINK,
> but that sounds like it would need to send two ring entries? I.e. memory
> and processing overhead?

Overhead should be minimal.

> And then, no way to further optimize and do fuse_req allocation on the
> ring (if there are free entries). And probably also no way that we ever
> let the application work in the SQPOLL way, because the application
> thread does not have the right to read from the fuse-server buffer? I.e.
> what I mean is that IORING_OP_URING_CMD gives a better flexibility.

There should be no difference between IORING_OP_URING_CMD and
IORING_OP_WRITEV +  IORING_OP_READV in this respect.  At least I don't
see why polling would work differently: the writev should complete
immediately and then the readv is queued.  Same as what effectively
happens with IORING_OP_URING_CMD, no?

> Btw, another issue that is avoided with the new ring-request layout is
> compatibility and alignment. The fuse header is always in a 4K section
> of the request data follow then. I.e. extending request sizes does not
> impose compatibility issues any more and also for passthrough and
> similar - O_DIRECT can be passed through to the backend file system.
> Although these issues probably need to be solved into the current fuse
> protocol.

Yes.

> Last but not least, with separation there is no global async queue
> anymore - no global lock and cache issues.

The global async code should be moved into the /dev/fuse specific
"legacy" queuing so it doesn't affect either uring or virtiofs
queuing.

> >> https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@xxxxxxxxxxxxxx/
> >>
> >> This cache line bouncing should be addressed by these patches
> >> as well.
> >
> > Why do you think this is solved?
>
>
> I _guess_ that writing to the mmaped buffer and processing requests on
> the same cpu core should make it possible to keep things in cpu cache. I
> did not verify that with perf, though.

Well, the issue is with any context switch that happens in the
multithreaded fuse server process.  Shared address spaces will have a
common "which CPU this is currently running on" bitmap
(mm->cpu_bitmap), which is updated whenever one of the threads using
this address space gets scheduled or descheduled.

Now imagine a fuse server running on a big numa system, which has
threads bound to each CPU.  The server tries to avoid using sharing
data structures between threads, so that cache remains local.  But it
can't avoid updating this bitmap on schedule.  The bitmap can pack 512
CPUs into a single cacheline, which means that thread locality is
compromised.

I'm somewhat surprised that this doesn't turn up in profiles in real
life, but I guess it's not a big deal in most cases.  I only observed
it with a special "no-op" fuse server running on big numa and with
per-thread queuing, etc. enabled (fuse2).

> For sync requests getting the scheduler involved is what is responsible
> for making really fuse slow. It schedules on random cores, that are in
> sleep states and additionally frequency scaling does not go up. We
> really need to stay on the same core here, as that is submitting the
> result, the core is already running (i.e. not sleeping) and has data in
> its cache. All benchmark results with sync requests point that out.

No arguments about that.

> For async requests, the above still applies, but basically one thread is
> writing/reading and the other thread handles/provides the data. Random
> switching of cores is then still not good. At best and to be tested,
> submitting rather large chunks to other cores.
> What is indeed to be discussed (and think annotated in the corresponding
> patch description), if there is a way to figure out if the other core is
> already busy. But then the scheduler does not know what it is busy with
> - are these existing fuse requests or something else. That part is
> really hard and I don't think it makes sense to discuss this right now
> before the main part is merged. IMHO, better to add a config flag for
> the current cpu+1 scheduling with an annotation that this setting might
> go away in the future.

The cpu + 1 seems pretty arbitrary, and could be a very bad decision
if there are independent tasks bound to certain CPUs or if the target
turns out to be on a very distant CPU.

I'm not sure what the right answer is.   It's probably something like:
try to schedule this on a CPU which is not busy but is not very
distant from this one.  The scheduler can probably answer this
question, but there's no current API for this.

Thanks,
Miklos