Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring

Miklos Szeredi <miklos@xxxxxxxxxx> · Tue, 11 Jun 2024 10:20:42 +0200

On Wed, 29 May 2024 at 20:01, Bernd Schubert <bschubert@xxxxxxx> wrote:
>
> From: Bernd Schubert <bschubert@xxxxxxx>
>
> This adds support for uring communication between kernel and
> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
> appraoch was taken from ublk.  The patches are in RFC state,
> some major changes are still to be expected.

Thank you very much for tackling this.  I think this is an important
feature and one that could potentially have a significant effect on
fuse performance, which is something many people would love to see.

I'm thinking about the architecture and there are some questions:

Have you tried just plain IORING_OP_READV / IORING_OP_WRITEV?  That's
would just be the async part, without the mapped buffer.  I suspect
most of the performance advantage comes from this and the per-CPU
queue, not from the mapped buffer, yet most of the complexity seems to
be related to the mapped buffer.

Maybe there's an advantage in using an atomic op for WRITEV + READV,
but I'm not quite seeing it yet, since there's no syscall overhead for
separate ops.

What's the reason for separate async and sync request queues?

> Avoiding cache line bouncing / numa systems was discussed
> between Amir and Miklos before and Miklos had posted
> part of the private discussion here
> https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@xxxxxxxxxxxxxx/
>
> This cache line bouncing should be addressed by these patches
> as well.

Why do you think this is solved?

> I had also noticed waitq wake-up latencies in fuse before
> https://lore.kernel.org/lkml/9326bb76-680f-05f6-6f78-df6170afaa2c@xxxxxxxxxxx/T/
>
> This spinning approach helped with performance (>40% improvement
> for file creates), but due to random server side thread/core utilization
> spinning cannot be well controlled in /dev/fuse mode.
> With fuse-over-io-uring requests are handled on the same core
> (sync requests) or on core+1 (large async requests) and performance
> improvements are achieved without spinning.

I feel this should be a scheduler decision, but the selecting the
queue needs to be based on that decision.  Maybe the scheduler people
can help out with this.

Thanks,
Miklos