On Wed, Jun 05, 2024 at 10:40:44AM +0000, Peter-Jan Gootzen wrote: > On Wed, 2024-05-29 at 14:32 -0400, Stefan Hajnoczi wrote: > > On Wed, May 29, 2024 at 05:52:07PM +0200, Miklos Szeredi wrote: > > > Virtiofs has its own queing mechanism, but still requests are first > > > queued > > > on fiq->pending to be immediately dequeued and queued onto the > > > virtio > > > queue. > > > > > > The queuing on fiq->pending is unnecessary and might even have some > > > performance impact due to being a contention point. > > > > > > Forget requests are handled similarly. > > > > > > Move the queuing of requests and forgets into the fiq->ops->*. > > > fuse_iqueue_ops are renamed to reflect the new semantics. > > > > > > Signed-off-by: Miklos Szeredi <mszeredi@xxxxxxxxxx> > > > --- > > > fs/fuse/dev.c | 159 ++++++++++++++++++++++++----------------- > > > --- > > > fs/fuse/fuse_i.h | 19 ++---- > > > fs/fuse/virtio_fs.c | 41 ++++-------- > > > 3 files changed, 106 insertions(+), 113 deletions(-) > > > > This is a little scary but I can't think of a scenario where directly > > dispatching requests to virtqueues is a problem. > > > > Is there someone who can run single and multiqueue virtiofs > > performance > > benchmarks? > > > > Reviewed-by: Stefan Hajnoczi <stefanha@xxxxxxxxxx> > > I ran some tests and experiments on the patch (on top of v6.10-rc2) with > our multi-queue capable virtio-fs device. No issues were found. > > Experimental system setup (which is not the fastest possible setup nor > the most optimized setup!): > # Host: > - Dell PowerEdge R7525 > - CPU: 2x AMD EPYC 7413 24-Core > - VM: QEMU KVM with 24 cores, vCPUs locked to the NUMA nodes on which > the DPU is attached. VFIO-pci device to passthrough the DPU. > Running a default x86_64 ext4 buildroot with fio installed. > # Virtio-fs device: > - BlueField-3 DPU > - CPU: ARM Cortex-A78AE, 16 cores > - One thread per queue, each busy polling on one request queue > - Each queue is 1024 descriptors deep > # Workload (deviations are specified in the table): > - fio 3.34 > - sequential read > - ioengine=io_uring, single 4GiB file, iodepth=128, bs=256KiB, > runtime=30s, ramp_time=10s, direct=1 > - T is the number of threads (numjobs=T with thread=1) > - Q is the number of request queues > > | Workload | Before patch | After patch | > | ------------------ | ------------ | ----------- | > | T=1 Q=1 | 9216MiB/s | 9201MiB/s | > | T=2 Q=2 | 10.8GiB/s | 10.7GiB/s | > | T=4 Q=4 | 12.6GiB/s | 12.2GiB/s | > | T=8 Q=8 | 19.5GiB/s | 19.7GiB/s | > | T=16 Q=1 | 9451MiB/s | 9558MiB/s | > | T=16 Q=2 | 13.5GiB/s | 13.4GiB/s | > | T=16 Q=4 | 11.8GiB/s | 11.4GiB/s | > | T=16 Q=8 | 11.1GiB/s | 10.8GiB/s | > | T=24 Q=24 | 26.5GiB/s | 26.5GiB/s | > | T=24 Q=24 24 files | 26.5GiB/s | 26.6GiB/s | > | T=24 Q=24 4k | 948MiB/s | 955MiB/s | > > Averaging out those results, the difference is within a reasonable > margin of a error (less than 1%). So in this setup's > case we see no difference in performance. > However if the virtio-fs device was more optimized, e.g. if it didn't > copy the data to its memory, then the bottleneck could possibly be on > the driver-side and this patch could show some benefit at those higher > message rates. > > So although I would have hoped for some performance increase already > with this setup, I still think this is a good patch and a logical > optimization for high performance virtio-fs devices that might show a > benefit in the future. > > Tested-by: Peter-Jan Gootzen <pgootzen@xxxxxxxxxx> > Reviewed-by: Peter-Jan Gootzen <pgootzen@xxxxxxxxxx> Thank you! Stefan
Attachment:
signature.asc
Description: PGP signature