Re: virtio-fs: adding support for multi-queue

Vivek Goyal <vgoyal@xxxxxxxxxx> · Tue, 7 Mar 2023 17:26:33 -0500

On Tue, Mar 07, 2023 at 08:43:33PM +0100, Peter-Jan Gootzen wrote:
> On 22-02-2023 15:32, Stefan Hajnoczi wrote:
> > On Wed, Feb 08, 2023 at 05:29:25PM +0100, Peter-Jan Gootzen wrote:
> > > On 08/02/2023 11:43, Stefan Hajnoczi wrote:
> > > > On Wed, Feb 08, 2023 at 09:33:33AM +0100, Peter-Jan Gootzen wrote:
> > > > > 
> > > > > 
> > > > > On 07/02/2023 22:57, Vivek Goyal wrote:
> > > > > > On Tue, Feb 07, 2023 at 04:32:02PM -0500, Stefan Hajnoczi wrote:
> > > > > > > On Tue, Feb 07, 2023 at 02:53:58PM -0500, Vivek Goyal wrote:
> > > > > > > > On Tue, Feb 07, 2023 at 02:45:39PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > > On Tue, Feb 07, 2023 at 11:14:46AM +0100, Peter-Jan Gootzen wrote:
> > > > > > > > > > Hi,
> > > > > > > > > > 
> > > > > > > > 
> > > > > > > > [cc German]
> > > > > > > > 
> > > > > > > > > > For my MSc thesis project in collaboration with IBM
> > > > > > > > > > (https://github.com/IBM/dpu-virtio-fs) we are looking to improve the
> > > > > > > > > > performance of the virtio-fs driver in high throughput scenarios. We think
> > > > > > > > > > the main bottleneck is the fact that the virtio-fs driver does not support
> > > > > > > > > > multi-queue (while the spec does). A big factor in this is that our setup on
> > > > > > > > > > the virtio-fs device-side (a DPU) does not easily allow multiple cores to
> > > > > > > > > > tend to a single virtio queue.
> > > > > > > > 
> > > > > > > > This is an interesting limitation in DPU.
> > > > > > > 
> > > > > > > Virtqueues are single-consumer queues anyway. Sharing them between
> > > > > > > multiple threads would be expensive. I think using multiqueue is natural
> > > > > > > and not specific to DPUs.
> > > > > > 
> > > > > > Can we create multiple threads (a thread pool) on DPU and let these
> > > > > > threads process requests in parallel (While there is only one virt
> > > > > > queue).
> > > > > > 
> > > > > > So this is what we had done in virtiofsd. One thread is dedicated to
> > > > > > pull the requests from virt queue and then pass the request to thread
> > > > > > pool to process it. And that seems to help with performance in
> > > > > > certain cases.
> > > > > > 
> > > > > > Is that possible on DPU? That itself can give a nice performance
> > > > > > boost for certain workloads without having to implement multiqueue
> > > > > > actually.
> > > > > > 
> > > > > > Just curious. I am not opposed to the idea of multiqueue. I am
> > > > > > just curious about the kind of performance gain (if any) it can
> > > > > > provide. And will this be helpful for rust virtiofsd running on
> > > > > > host as well?
> > > > > > 
> > > > > > Thanks
> > > > > > Vivek
> > > > > > 
> > > > > There is technically nothing preventing us from consuming a single queue on
> > > > > multiple cores, however our current Virtio implementation (DPU-side) is set
> > > > > up with the assumption that you should never want to do that (concurrency
> > > > > mayham around the Virtqueues and the DMAs). So instead of putting all the
> > > > > work into reworking the implementation to support that and still incur the
> > > > > big overhead, we see it more fitting to amend the virtio-fs driver with
> > > > > multi-queue support.
> > > > > 
> > > > > 
> > > > > > Is it just a theory at this point of time or have you implemented
> > > > > > it and seeing significant performance benefit with multiqueue?
> > > > > 
> > > > > It is a theory, but we are currently seeing that using the single request
> > > > > queue, the single core attending to that queue on the DPU is reasonably
> > > > > close to being fully saturated.
> > > > > 
> > > > > > And will this be helpful for rust virtiofsd running on
> > > > > > host as well?
> > > > > 
> > > > > I figure this would be dependent on the workload and the users-needs.
> > > > > Having many cores concurrently pulling on their own virtq and then
> > > > > immediately process the request locally would of course improve performance.
> > > > > But we are offloading all this work to the DPU, for providing
> > > > > high-throughput cloud services.
> > > > 
> > > > I think Vivek is getting at whether your code processes requests
> > > > sequentially or in parallel. A single thread processing the virtqueue
> > > > that hands off requests to worker threads or uses io_uring to perform
> > > > I/O asynchronously will perform differently from a single thread that
> > > > processes requests sequentially in a blocking fashion. Multiqueue is not
> > > > necessary for parallelism, but the single queue might become a
> > > > bottleneck.
> > > 
> > > Requests are handled non-blocking with remote IO on the DPU. Our current
> > > architecture is as follows:
> > > T1: Tends to the Virtq, parses FUSE to remote IO and fires off the
> > > asynchronous remote IO.
> > > T2: Polls for completion on the remote IO and parses it back to FUSE, puts
> > > the FUSE buffers in a completion queue of T1.
> > > T1: Handles the Virtio completion and DMA of the requests in the CQ.
> > > 
> > > Thread 1 is busy polling on its two queues (Virtq and CQ) with equal
> > > priority, thread 2 is busy polling as well. This setup is not really
> > > optimal, but we are working within the constraints of both our DPU and
> > > remote IO stack.
> > 
> > Why does T1 need to handle VIRTIO completion and DMA requests instead of
> > T2?
> > 
> > Stefan
> 
> No good reason other than the fact that the concurrency safety of our DPU's
> virtio-fs library requires this.
> 
> > I had been doing some performance benchmarking for virtio-fs and I found
> > some old results.
> >
> > https://github.com/rhvgoyal/virtiofs-tests/tree/master/performance-results/feb-10-2021
> >
> > While running on top of local fs, with bs=4K, with single queue I could
> > achieve more than 600MB/s.
> >
> > NAME                    WORKLOAD                Bandwidth       IOPS
> > default                 seqread-psync           625.0mb         156.2k
> > no-tpool                seqread-psync           660.8mb         165.2k
> >
> > But catch here I think is that host is doing the caching. In your
> > case I am assuming there is no caching at DPU and all the I/O is
> > going to remote storage (which might be doing caching in memory).
> >
> > Anyway, point I am trying to make is that even with single vq, virtiofs
> > can push a reasonable amount of I/O.
> >
> > I will be cuirous to find how multiqueue can improve these numbers
> > further.
> 
> We are currently seeing the following throughput numbers:
> https://github.com/IBM/dpu-virtio-fs/blob/d0e0560546e2da86b0022a69abe02ab6ac4a6541/experiments/results/graphs/nulldev_tp.pdf
> This is using a null device implementation in the DPU (immediately return
> reads and writes in the FUSE file system). All using a single vq and one DPU
> thread attending to it. On the host this experiment is using two fio threads
> pinned to the DPU's NUMA node. We see no additional throughput when using
> more than two threads.

As per this chart, you are getting around 1GB/s with 4K size. So that's
roughly 256K IOPS with single queue. Not too bad I would say.

Would be interesting to see how multiqueue support impacts that number.

Thanks
Vivek

_______________________________________________
Virtualization mailing list
Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linuxfoundation.org/mailman/listinfo/virtualization