Re: [PATCH] fuse: cleanup request queuing towards virtiofs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 2024-05-29 at 14:32 -0400, Stefan Hajnoczi wrote:
> On Wed, May 29, 2024 at 05:52:07PM +0200, Miklos Szeredi wrote:
> > Virtiofs has its own queing mechanism, but still requests are first
> > queued
> > on fiq->pending to be immediately dequeued and queued onto the
> > virtio
> > queue.
> > 
> > The queuing on fiq->pending is unnecessary and might even have some
> > performance impact due to being a contention point.
> > 
> > Forget requests are handled similarly.
> > 
> > Move the queuing of requests and forgets into the fiq->ops->*.
> > fuse_iqueue_ops are renamed to reflect the new semantics.
> > 
> > Signed-off-by: Miklos Szeredi <mszeredi@xxxxxxxxxx>
> > ---
> >  fs/fuse/dev.c       | 159 ++++++++++++++++++++++++-----------------
> > ---
> >  fs/fuse/fuse_i.h    |  19 ++----
> >  fs/fuse/virtio_fs.c |  41 ++++--------
> >  3 files changed, 106 insertions(+), 113 deletions(-)
> 
> This is a little scary but I can't think of a scenario where directly
> dispatching requests to virtqueues is a problem.
> 
> Is there someone who can run single and multiqueue virtiofs
> performance
> benchmarks?
> 
> Reviewed-by: Stefan Hajnoczi <stefanha@xxxxxxxxxx>

I ran some tests and experiments on the patch (on top of v6.10-rc2) with
our multi-queue capable virtio-fs device. No issues were found.

Experimental system setup (which is not the fastest possible setup nor
the most optimized setup!):
# Host:
   - Dell PowerEdge R7525
   - CPU: 2x AMD EPYC 7413 24-Core
   - VM: QEMU KVM with 24 cores, vCPUs locked to the NUMA nodes on which
the DPU is attached. VFIO-pci device to passthrough the DPU.           
Running a default x86_64 ext4 buildroot with fio installed.
# Virtio-fs device:
   - BlueField-3 DPU
   - CPU: ARM Cortex-A78AE, 16 cores
   - One thread per queue, each busy polling on one request queue
   - Each queue is 1024 descriptors deep
# Workload (deviations are specified in the table):
   - fio 3.34
   - sequential read
   - ioengine=io_uring, single 4GiB file, iodepth=128, bs=256KiB,    
runtime=30s, ramp_time=10s, direct=1
   - T is the number of threads (numjobs=T with thread=1)
   - Q is the number of request queues

| Workload           | Before patch | After patch |
| ------------------ | ------------ | ----------- |
| T=1 Q=1            | 9216MiB/s    | 9201MiB/s   |
| T=2 Q=2            | 10.8GiB/s    | 10.7GiB/s   |
| T=4 Q=4            | 12.6GiB/s    | 12.2GiB/s   |
| T=8 Q=8            | 19.5GiB/s    | 19.7GiB/s   |
| T=16 Q=1           | 9451MiB/s    | 9558MiB/s   |
| T=16 Q=2           | 13.5GiB/s    | 13.4GiB/s   |
| T=16 Q=4           | 11.8GiB/s    | 11.4GiB/s   |
| T=16 Q=8           | 11.1GiB/s    | 10.8GiB/s   |
| T=24 Q=24          | 26.5GiB/s    | 26.5GiB/s   |
| T=24 Q=24 24 files | 26.5GiB/s    | 26.6GiB/s   |
| T=24 Q=24 4k       | 948MiB/s     | 955MiB/s    |

Averaging out those results, the difference is within a reasonable
margin of a error (less than 1%). So in this setup's
case we see no difference in performance.
However if the virtio-fs device was more optimized, e.g. if it didn't
copy the data to its memory, then the bottleneck could possibly be on
the driver-side and this patch could show some benefit at those higher
message rates.

So although I would have hoped for some performance increase already
with this setup, I still think this is a good patch and a logical
optimization for high performance virtio-fs devices that might show a
benefit in the future.

Tested-by: Peter-Jan Gootzen <pgootzen@xxxxxxxxxx>
Reviewed-by: Peter-Jan Gootzen <pgootzen@xxxxxxxxxx>





[Index of Archives]     [KVM Development]     [Libvirt Development]     [Libvirt Users]     [CentOS Virtualization]     [Netdev]     [Ethernet Bridging]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux