On Wed, Sep 08, 2010 at 12:58:59PM +0530, Krishna Kumar wrote: > Following patches implement Transmit mq in virtio-net. Also > included is the user qemu changes. > > 1. This feature was first implemented with a single vhost. > Testing showed 3-8% performance gain for upto 8 netperf > sessions (and sometimes 16), but BW dropped with more > sessions. However, implementing per-txq vhost improved > BW significantly all the way to 128 sessions. > 2. For this mq TX patch, 1 daemon is created for RX and 'n' > daemons for the 'n' TXQ's, for a total of (n+1) daemons. > The (subsequent) RX mq patch changes that to a total of > 'n' daemons, where RX and TX vq's share 1 daemon. > 3. Service Demand increases for TCP, but significantly > improves for UDP. > 4. Interoperability: Many combinations, but not all, of > qemu, host, guest tested together. > > > Enabling mq on virtio: > ----------------------- > > When following options are passed to qemu: > - smp > 1 > - vhost=on > - mq=on (new option, default:off) > then #txqueues = #cpus. The #txqueues can be changed by using > an optional 'numtxqs' option. e.g. for a smp=4 guest: > vhost=on,mq=on -> #txqueues = 4 > vhost=on,mq=on,numtxqs=8 -> #txqueues = 8 > vhost=on,mq=on,numtxqs=2 -> #txqueues = 2 > > > Performance (guest -> local host): > ----------------------------------- > > System configuration: > Host: 8 Intel Xeon, 8 GB memory > Guest: 4 cpus, 2 GB memory > All testing without any tuning, and TCP netperf with 64K I/O > _______________________________________________________________________________ > TCP (#numtxqs=2) > N# BW1 BW2 (%) SD1 SD2 (%) RSD1 RSD2 (%) > _______________________________________________________________________________ > 4 26387 40716 (54.30) 20 28 (40.00) 86i 85 (-1.16) > 8 24356 41843 (71.79) 88 129 (46.59) 372 362 (-2.68) > 16 23587 40546 (71.89) 375 564 (50.40) 1558 1519 (-2.50) > 32 22927 39490 (72.24) 1617 2171 (34.26) 6694 5722 (-14.52) > 48 23067 39238 (70.10) 3931 5170 (31.51) 15823 13552 (-14.35) > 64 22927 38750 (69.01) 7142 9914 (38.81) 28972 26173 (-9.66) > 96 22568 38520 (70.68) 16258 27844 (71.26) 65944 73031 (10.74) That's a significant hit in TCP SD. Is it caused by the imbalance between number of queues for TX and RX? Since you mention RX is complete, maybe measure with a balanced TX/RX? > _______________________________________________________________________________ > UDP (#numtxqs=8) > N# BW1 BW2 (%) SD1 SD2 (%) > __________________________________________________________ > 4 29836 56761 (90.24) 67 63 (-5.97) > 8 27666 63767 (130.48) 326 265 (-18.71) > 16 25452 60665 (138.35) 1396 1269 (-9.09) > 32 26172 63491 (142.59) 5617 4202 (-25.19) > 48 26146 64629 (147.18) 12813 9316 (-27.29) > 64 25575 65448 (155.90) 23063 16346 (-29.12) > 128 26454 63772 (141.06) 91054 85051 (-6.59) > __________________________________________________________ > N#: Number of netperf sessions, 90 sec runs > BW1,SD1,RSD1: Bandwidth (sum across 2 runs in mbps), SD and Remote > SD for original code > BW2,SD2,RSD2: Bandwidth (sum across 2 runs in mbps), SD and Remote > SD for new code. e.g. BW2=40716 means average BW2 was > 20358 mbps. > What happens with a single netperf? host -> guest performance with TCP and small packet speed are also worth measuring. > Next steps: > ----------- > > 1. mq RX patch is also complete - plan to submit once TX is OK. > 2. Cache-align data structures: I didn't see any BW/SD improvement > after making the sq's (and similarly for vhost) cache-aligned > statically: > struct virtnet_info { > ... > struct send_queue sq[16] ____cacheline_aligned_in_smp; > ... > }; > At some level, host/guest communication is easy in that we don't really care which queue is used. I would like to give some thought (and testing) to how is this going to work with a real NIC card and packet steering at the backend. Any idea? > Guest interrupts for a 4 TXQ device after a 5 min test: > # egrep "virtio0|CPU" /proc/interrupts > CPU0 CPU1 CPU2 CPU3 > 40: 0 0 0 0 PCI-MSI-edge virtio0-config > 41: 126955 126912 126505 126940 PCI-MSI-edge virtio0-input > 42: 108583 107787 107853 107716 PCI-MSI-edge virtio0-output.0 > 43: 300278 297653 299378 300554 PCI-MSI-edge virtio0-output.1 > 44: 372607 374884 371092 372011 PCI-MSI-edge virtio0-output.2 > 45: 162042 162261 163623 162923 PCI-MSI-edge virtio0-output.3 Does this mean each interrupt is constantly bouncing between CPUs? > Review/feedback appreciated. > > Signed-off-by: Krishna Kumar <krkumar2@xxxxxxxxxx> > --- -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html