Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Wed, 8 Sep 2010 11:10:11 +0300

On Wed, Sep 08, 2010 at 12:58:59PM +0530, Krishna Kumar wrote:
> Following patches implement Transmit mq in virtio-net.  Also
> included is the user qemu changes.
> 
> 1. This feature was first implemented with a single vhost.
>    Testing showed 3-8% performance gain for upto 8 netperf
>    sessions (and sometimes 16), but BW dropped with more
>    sessions.  However, implementing per-txq vhost improved
>    BW significantly all the way to 128 sessions.
> 2. For this mq TX patch, 1 daemon is created for RX and 'n'
>    daemons for the 'n' TXQ's, for a total of (n+1) daemons.
>    The (subsequent) RX mq patch changes that to a total of
>    'n' daemons, where RX and TX vq's share 1 daemon.
> 3. Service Demand increases for TCP, but significantly
>    improves for UDP.
> 4. Interoperability: Many combinations, but not all, of
>    qemu, host, guest tested together.
> 
> 
>                   Enabling mq on virtio:
>                   -----------------------
> 
> When following options are passed to qemu:
>         - smp > 1
>         - vhost=on
>         - mq=on (new option, default:off)
> then #txqueues = #cpus.  The #txqueues can be changed by using
> an optional 'numtxqs' option. e.g.  for a smp=4 guest:
>         vhost=on,mq=on             ->   #txqueues = 4
>         vhost=on,mq=on,numtxqs=8   ->   #txqueues = 8
>         vhost=on,mq=on,numtxqs=2   ->   #txqueues = 2
> 
> 
>                    Performance (guest -> local host):
>                    -----------------------------------
> 
> System configuration:
>         Host:  8 Intel Xeon, 8 GB memory
>         Guest: 4 cpus, 2 GB memory
> All testing without any tuning, and TCP netperf with 64K I/O
> _______________________________________________________________________________
>                            TCP (#numtxqs=2)
> N#      BW1     BW2    (%)      SD1     SD2    (%)      RSD1    RSD2    (%)
> _______________________________________________________________________________
> 4       26387   40716 (54.30)   20      28   (40.00)    86i     85     (-1.16)
> 8       24356   41843 (71.79)   88      129  (46.59)    372     362    (-2.68)
> 16      23587   40546 (71.89)   375     564  (50.40)    1558    1519   (-2.50)
> 32      22927   39490 (72.24)   1617    2171 (34.26)    6694    5722   (-14.52)
> 48      23067   39238 (70.10)   3931    5170 (31.51)    15823   13552  (-14.35)
> 64      22927   38750 (69.01)   7142    9914 (38.81)    28972   26173  (-9.66)
> 96      22568   38520 (70.68)   16258   27844 (71.26)   65944   73031  (10.74)

That's a significant hit in TCP SD. Is it caused by the imbalance between
number of queues for TX and RX? Since you mention RX is complete,
maybe measure with a balanced TX/RX?

> _______________________________________________________________________________
>                        UDP (#numtxqs=8)
> N#      BW1     BW2   (%)      SD1     SD2   (%)
> __________________________________________________________
> 4       29836   56761 (90.24)   67      63    (-5.97)
> 8       27666   63767 (130.48)  326     265   (-18.71)
> 16      25452   60665 (138.35)  1396    1269  (-9.09)
> 32      26172   63491 (142.59)  5617    4202  (-25.19)
> 48      26146   64629 (147.18)  12813   9316  (-27.29)
> 64      25575   65448 (155.90)  23063   16346 (-29.12)
> 128     26454   63772 (141.06)  91054   85051 (-6.59)
> __________________________________________________________
> N#: Number of netperf sessions, 90 sec runs
> BW1,SD1,RSD1: Bandwidth (sum across 2 runs in mbps), SD and Remote
>               SD for original code
> BW2,SD2,RSD2: Bandwidth (sum across 2 runs in mbps), SD and Remote
>               SD for new code. e.g. BW2=40716 means average BW2 was
>               20358 mbps.
> 

What happens with a single netperf?
host -> guest performance with TCP and small packet speed
are also worth measuring.

>                        Next steps:
>                        -----------
> 
> 1. mq RX patch is also complete - plan to submit once TX is OK.
> 2. Cache-align data structures: I didn't see any BW/SD improvement
>    after making the sq's (and similarly for vhost) cache-aligned
>    statically:
>         struct virtnet_info {
>                 ...
>                 struct send_queue sq[16] ____cacheline_aligned_in_smp;
>                 ...
>         };
> 

At some level, host/guest communication is easy in that we don't really
care which queue is used.  I would like to give some thought (and
testing) to how is this going to work with a real NIC card and packet
steering at the backend.
Any idea?

> Guest interrupts for a 4 TXQ device after a 5 min test:
> # egrep "virtio0|CPU" /proc/interrupts 
>       CPU0     CPU1     CPU2    CPU3       
> 40:   0        0        0       0        PCI-MSI-edge  virtio0-config
> 41:   126955   126912   126505  126940   PCI-MSI-edge  virtio0-input
> 42:   108583   107787   107853  107716   PCI-MSI-edge  virtio0-output.0
> 43:   300278   297653   299378  300554   PCI-MSI-edge  virtio0-output.1
> 44:   372607   374884   371092  372011   PCI-MSI-edge  virtio0-output.2
> 45:   162042   162261   163623  162923   PCI-MSI-edge  virtio0-output.3

Does this mean each interrupt is constantly bouncing between CPUs?

> Review/feedback appreciated.
> 
> Signed-off-by: Krishna Kumar <krkumar2@xxxxxxxxxx>
> ---
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html