Re: virtio-fs: adding support for multi-queue

Peter-Jan Gootzen via Virtualization <virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx> · Wed, 8 Feb 2023 17:29:25 +0100

On 08/02/2023 11:43, Stefan Hajnoczi wrote:
On Wed, Feb 08, 2023 at 09:33:33AM +0100, Peter-Jan Gootzen wrote:

On 07/02/2023 22:57, Vivek Goyal wrote:
On Tue, Feb 07, 2023 at 04:32:02PM -0500, Stefan Hajnoczi wrote:
On Tue, Feb 07, 2023 at 02:53:58PM -0500, Vivek Goyal wrote:
On Tue, Feb 07, 2023 at 02:45:39PM -0500, Stefan Hajnoczi wrote:
On Tue, Feb 07, 2023 at 11:14:46AM +0100, Peter-Jan Gootzen wrote:
Hi,

[cc German]

For my MSc thesis project in collaboration with IBM
(https://github.com/IBM/dpu-virtio-fs) we are looking to improve the
performance of the virtio-fs driver in high throughput scenarios. We think
the main bottleneck is the fact that the virtio-fs driver does not support
multi-queue (while the spec does). A big factor in this is that our setup on
the virtio-fs device-side (a DPU) does not easily allow multiple cores to
tend to a single virtio queue.

This is an interesting limitation in DPU.

Virtqueues are single-consumer queues anyway. Sharing them between
multiple threads would be expensive. I think using multiqueue is natural
and not specific to DPUs.

Can we create multiple threads (a thread pool) on DPU and let these
threads process requests in parallel (While there is only one virt
queue).

So this is what we had done in virtiofsd. One thread is dedicated to
pull the requests from virt queue and then pass the request to thread
pool to process it. And that seems to help with performance in
certain cases.

Is that possible on DPU? That itself can give a nice performance
boost for certain workloads without having to implement multiqueue
actually.

Just curious. I am not opposed to the idea of multiqueue. I am
just curious about the kind of performance gain (if any) it can
provide. And will this be helpful for rust virtiofsd running on
host as well?

Thanks
Vivek

There is technically nothing preventing us from consuming a single queue on
multiple cores, however our current Virtio implementation (DPU-side) is set
up with the assumption that you should never want to do that (concurrency
mayham around the Virtqueues and the DMAs). So instead of putting all the
work into reworking the implementation to support that and still incur the
big overhead, we see it more fitting to amend the virtio-fs driver with
multi-queue support.

Is it just a theory at this point of time or have you implemented
it and seeing significant performance benefit with multiqueue?

It is a theory, but we are currently seeing that using the single request
queue, the single core attending to that queue on the DPU is reasonably
close to being fully saturated.

And will this be helpful for rust virtiofsd running on
host as well?

I figure this would be dependent on the workload and the users-needs.
Having many cores concurrently pulling on their own virtq and then
immediately process the request locally would of course improve performance.
But we are offloading all this work to the DPU, for providing
high-throughput cloud services.

I think Vivek is getting at whether your code processes requests
sequentially or in parallel. A single thread processing the virtqueue
that hands off requests to worker threads or uses io_uring to perform
I/O asynchronously will perform differently from a single thread that
processes requests sequentially in a blocking fashion. Multiqueue is not
necessary for parallelism, but the single queue might become a
bottleneck.

Requests are handled non-blocking with remote IO on the DPU. Our current 
architecture is as follows:
T1: Tends to the Virtq, parses FUSE to remote IO and fires off the 
asynchronous remote IO.
T2: Polls for completion on the remote IO and parses it back to FUSE, 
puts the FUSE buffers in a completion queue of T1.
T1: Handles the Virtio completion and DMA of the requests in the CQ.

Thread 1 is busy polling on its two queues (Virtq and CQ) with equal 
priority, thread 2 is busy polling as well. This setup is not really 
optimal, but we are working within the constraints of both our DPU and 
remote IO stack.
Currently we are able to get with sequential single job 4k throughput:
Write: 246MiB/s
Read: 20MiB/s
We are not sure yet where the bottleneck is for reads, we hope to be 
able to match it to the write speed. For writes the two main bottlenecks 
we see are: the single Virtq (so limited parallelism on the DPU and 
remote-side) and that virtio-fs IO is constrained to the page size of 4k 
(NFS for example, who we are trying to replace, sees huge performance 
gains with larger block sizes).

This is what I remembered as well, but can't find it clearly in the source
right now, do you have references to the source for this?

virtio_blk.ko uses an irq_affinity descriptor to tell virtio_find_vqs()
to spread MSI interrupts across CPUs:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/block/virtio_blk.c#n609

The core blk-mq code has the blk_mq_virtio_map_queues() function to map
block layer queues to virtqueues:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/block/blk-mq-virtio.c#n24

virtio_net.ko manually sets virtqueue affinity:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/virtio_net.c#n2283

virtio_net.ko tells the core net subsystem about queues using
netif_set_real_num_tx_queues() and then skbs are mapped to queues by
common code:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/core/dev.c#n4079

Thanks for the pointers. :)

Thanks,
Peter-Jan

_______________________________________________
Virtualization mailing list
Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linuxfoundation.org/mailman/listinfo/virtualization