On Wed, 2024-05-29 at 14:32 -0400, Stefan Hajnoczi wrote: > On Wed, May 29, 2024 at 05:52:07PM +0200, Miklos Szeredi wrote: > > Virtiofs has its own queing mechanism, but still requests are first > > queued > > on fiq->pending to be immediately dequeued and queued onto the > > virtio > > queue. > > > > The queuing on fiq->pending is unnecessary and might even have some > > performance impact due to being a contention point. > > > > Forget requests are handled similarly. > > > > Move the queuing of requests and forgets into the fiq->ops->*. > > fuse_iqueue_ops are renamed to reflect the new semantics. > > > > Signed-off-by: Miklos Szeredi <mszeredi@xxxxxxxxxx> > > --- > > fs/fuse/dev.c | 159 ++++++++++++++++++++++++----------------- > > --- > > fs/fuse/fuse_i.h | 19 ++---- > > fs/fuse/virtio_fs.c | 41 ++++-------- > > 3 files changed, 106 insertions(+), 113 deletions(-) > > This is a little scary but I can't think of a scenario where directly > dispatching requests to virtqueues is a problem. > > Is there someone who can run single and multiqueue virtiofs > performance > benchmarks? > > Reviewed-by: Stefan Hajnoczi <stefanha@xxxxxxxxxx> I ran some tests and experiments on the patch (on top of v6.10-rc2) with our multi-queue capable virtio-fs device. No issues were found. Experimental system setup (which is not the fastest possible setup nor the most optimized setup!): # Host: - Dell PowerEdge R7525 - CPU: 2x AMD EPYC 7413 24-Core - VM: QEMU KVM with 24 cores, vCPUs locked to the NUMA nodes on which the DPU is attached. VFIO-pci device to passthrough the DPU. Running a default x86_64 ext4 buildroot with fio installed. # Virtio-fs device: - BlueField-3 DPU - CPU: ARM Cortex-A78AE, 16 cores - One thread per queue, each busy polling on one request queue - Each queue is 1024 descriptors deep # Workload (deviations are specified in the table): - fio 3.34 - sequential read - ioengine=io_uring, single 4GiB file, iodepth=128, bs=256KiB, runtime=30s, ramp_time=10s, direct=1 - T is the number of threads (numjobs=T with thread=1) - Q is the number of request queues | Workload | Before patch | After patch | | ------------------ | ------------ | ----------- | | T=1 Q=1 | 9216MiB/s | 9201MiB/s | | T=2 Q=2 | 10.8GiB/s | 10.7GiB/s | | T=4 Q=4 | 12.6GiB/s | 12.2GiB/s | | T=8 Q=8 | 19.5GiB/s | 19.7GiB/s | | T=16 Q=1 | 9451MiB/s | 9558MiB/s | | T=16 Q=2 | 13.5GiB/s | 13.4GiB/s | | T=16 Q=4 | 11.8GiB/s | 11.4GiB/s | | T=16 Q=8 | 11.1GiB/s | 10.8GiB/s | | T=24 Q=24 | 26.5GiB/s | 26.5GiB/s | | T=24 Q=24 24 files | 26.5GiB/s | 26.6GiB/s | | T=24 Q=24 4k | 948MiB/s | 955MiB/s | Averaging out those results, the difference is within a reasonable margin of a error (less than 1%). So in this setup's case we see no difference in performance. However if the virtio-fs device was more optimized, e.g. if it didn't copy the data to its memory, then the bottleneck could possibly be on the driver-side and this patch could show some benefit at those higher message rates. So although I would have hoped for some performance increase already with this setup, I still think this is a good patch and a logical optimization for high performance virtio-fs devices that might show a benefit in the future. Tested-by: Peter-Jan Gootzen <pgootzen@xxxxxxxxxx> Reviewed-by: Peter-Jan Gootzen <pgootzen@xxxxxxxxxx>