Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring

Bernd Schubert <bernd.schubert@xxxxxxxxxxx> · Tue, 11 Jun 2024 12:26:29 +0200

On 6/11/24 10:20, Miklos Szeredi wrote:
> On Wed, 29 May 2024 at 20:01, Bernd Schubert <bschubert@xxxxxxx> wrote:
>>
>> From: Bernd Schubert <bschubert@xxxxxxx>
>>
>> This adds support for uring communication between kernel and
>> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
>> appraoch was taken from ublk.  The patches are in RFC state,
>> some major changes are still to be expected.
> 
> Thank you very much for tackling this.  I think this is an important
> feature and one that could potentially have a significant effect on
> fuse performance, which is something many people would love to see.
> 
> I'm thinking about the architecture and there are some questions:
> 
> Have you tried just plain IORING_OP_READV / IORING_OP_WRITEV?  That's
> would just be the async part, without the mapped buffer.  I suspect
> most of the performance advantage comes from this and the per-CPU
> queue, not from the mapped buffer, yet most of the complexity seems to
> be related to the mapped buffer.

I didn't try because IORING_OP_URING_CMD seems to be exactly made for
our case.

Firstly and although I didn't have time to look into it yet, but with
the current approach it should be rather simple to switch to another
ring as Kent has suggested.

Secondly, with IORING_OP_URING_CMD we already have only a single command
to submit requests and fetch the next one - half of the system calls.

Wouldn't IORING_OP_READV/IORING_OP_WRITEV be just this approach?
https://github.com/uroni/fuseuring?
I.e. it hook into the existing fuse and just changes from read()/write()
of /dev/fuse to io-uring of /dev/fuse. With the disadvantage of zero
control which ring/queue and which ring-entry handles the request.

Thirdly, initially I had even removed the allocation of 'struct
fuse_req' and directly allocated these on available ring entries. I.e.
the application thread was writing to the mmap ring buffer. I just
removed that code for now, as it introduced additional complexity with
an unknown performance outcome. If it should be helpful we could add
that later. I don't think we have that flexibility with
IORING_OP_READV/IORING_OP_WRITEV.

And then, personally I do not see mmap complexity - it is just very
convenient to write/read to/from the ring buffer from any kernel thread.
Currently not supported by io-uring, but I think we could even avoid any
kind of system call and let the application poll for results. Similar to
what IORING_SETUP_SQPOLL does, but without the requirement of another
kernel thread.

Most complexity and issues I got, come from the requirement of io_uring
to complete requests with io_uring_op_cmd_done. In RFCv1 you had
annotated the async shutdown method - that was indeed really painful and
resulted in a never ending number of shutdown races. Once I removed that
and also removed using a bitmap (I don't even know why I used a bitmap
in the first place in RFCv1 and not lists as in RFCv2) shutdown became
manageable.

If there would be way to tell io-uring that kernel fuse is done and to
let it complete itself whatever was not completed yet, it would would be
great help. Although from my point of view, that could be done once this
series is merged.
Or we decide to switch to Kents new ring buffer and might not have that
problem at all...

> 
> Maybe there's an advantage in using an atomic op for WRITEV + READV,
> but I'm not quite seeing it yet, since there's no syscall overhead for
> separate ops.

Here I get confused, could please explain?
Current fuse has a read + write operation - a read() system call to
process a fuse request and a write() call to submit the result and then
read() to fetch the next request.
If userspace has to send IORING_OP_READV to fetch a request and complete
with IORING_OP_IORING_OP_WRITEV it would go through existing code path
with operations? Well, maybe userspace could submit with IOSQE_IO_LINK,
but that sounds like it would need to send two ring entries? I.e. memory
and processing overhead?

And then, no way to further optimize and do fuse_req allocation on the
ring (if there are free entries). And probably also no way that we ever
let the application work in the SQPOLL way, because the application
thread does not have the right to read from the fuse-server buffer? I.e.
what I mean is that IORING_OP_URING_CMD gives a better flexibility.

Btw, another issue that is avoided with the new ring-request layout is
compatibility and alignment. The fuse header is always in a 4K section
of the request data follow then. I.e. extending request sizes does not
impose compatibility issues any more and also for passthrough and
similar - O_DIRECT can be passed through to the backend file system.
Although these issues probably need to be solved into the current fuse
protocol.

> 
> What's the reason for separate async and sync request queues?

To have credits for IO operations. For an overlay file system it might
not matter, because it might get stuck with another system call in the
underlying file system. But for something that is not system call bound
and that has control, async and sync can be handled with priority given
by userspace.

As I had written in the introduction mail, I'm currently working on
different IO sizes per ring queue - it gets even more fine grained and
with the advantage of reduced memory usage per queue when the queue has
entries for many small requests and a few large ones.

Next step would here to add credits for reads/writes (or to reserve
credits for meta operations) in the sync queue, so that meta operations
can always go through. If there should be async meta operations (through
application io-uring requests?) would need to be done for the async
queue as well.

Last but not least, with separation there is no global async queue
anymore - no global lock and cache issues.

> 
>> Avoiding cache line bouncing / numa systems was discussed
>> between Amir and Miklos before and Miklos had posted
>> part of the private discussion here
>> https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@xxxxxxxxxxxxxx/
>>
>> This cache line bouncing should be addressed by these patches
>> as well.
> 
> Why do you think this is solved?

I _guess_ that writing to the mmaped buffer and processing requests on
the same cpu core should make it possible to keep things in cpu cache. I
did not verify that with perf, though.

> 
>> I had also noticed waitq wake-up latencies in fuse before
>> https://lore.kernel.org/lkml/9326bb76-680f-05f6-6f78-df6170afaa2c@xxxxxxxxxxx/T/
>>
>> This spinning approach helped with performance (>40% improvement
>> for file creates), but due to random server side thread/core utilization
>> spinning cannot be well controlled in /dev/fuse mode.
>> With fuse-over-io-uring requests are handled on the same core
>> (sync requests) or on core+1 (large async requests) and performance
>> improvements are achieved without spinning.
> 
> I feel this should be a scheduler decision, but the selecting the
> queue needs to be based on that decision.  Maybe the scheduler people
> can help out with this.

For sync requests getting the scheduler involved is what is responsible
for making really fuse slow. It schedules on random cores, that are in
sleep states and additionally frequency scaling does not go up. We
really need to stay on the same core here, as that is submitting the
result, the core is already running (i.e. not sleeping) and has data in
its cache. All benchmark results with sync requests point that out.

For async requests, the above still applies, but basically one thread is
writing/reading and the other thread handles/provides the data. Random
switching of cores is then still not good. At best and to be tested,
submitting rather large chunks to other cores.
What is indeed to be discussed (and think annotated in the corresponding
patch description), if there is a way to figure out if the other core is
already busy. But then the scheduler does not know what it is busy with
- are these existing fuse requests or something else. That part is
really hard and I don't think it makes sense to discuss this right now
before the main part is merged. IMHO, better to add a config flag for
the current cpu+1 scheduling with an annotation that this setting might
go away in the future.

Thanks,
Bernd