On Tue, Oct 05, 2010 at 04:10:00PM +0530, Krishna Kumar2 wrote: > "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote on 09/19/2010 06:14:43 PM: > > > Could you document how exactly do you measure multistream bandwidth: > > netperf flags, etc? > > All results were without any netperf flags or system tuning: > for i in $list > do > netperf -c -C -l 60 -H 192.168.122.1 > /tmp/netperf.$$.$i & > done > wait > Another script processes the result files. It also displays the > start time/end time of each iteration to make sure skew due to > parallel netperfs is minimal. > > I changed the vhost functionality once more to try to get the > best model, the new model being: > 1. #numtxqs=1 -> #vhosts=1, this thread handles both RX/TX. > 2. #numtxqs>1 -> vhost[0] handles RX and vhost[1-MAX] handles > TX[0-n], where MAX is 4. Beyond numtxqs=4, the remaining TX > queues are handled by vhost threads in round-robin fashion. > > Results from here on are with these changes, and only "tuning" is > to set each vhost's affinity to CPUs[0-3] ("taskset -p f <vhost-pids>"). > > > Any idea where does this come from? > > Do you see more TX interrupts? RX interrupts? Exits? > > Do interrupts bounce more between guest CPUs? > > 4. Identify reasons for single netperf BW regression. > > After testing various combinations of #txqs, #vhosts, #netperf > sessions, I think the drop for 1 stream is due to TX and RX for > a flow being processed on different cpus. Right. Can we fix it? > I did two more tests: > 1. Pin vhosts to same CPU: > - BW drop is much lower for 1 stream case (- 5 to -8% range) > - But performance is not so high for more sessions. > 2. Changed vhost to be single threaded: > - No degradation for 1 session, and improvement for upto > 8, sometimes 16 streams (5-12%). > - BW degrades after that, all the way till 128 netperf sessions. > - But overall CPU utilization improves. > Summary of the entire run (for 1-128 sessions): > txq=4: BW: (-2.3) CPU: (-16.5) RCPU: (-5.3) > txq=16: BW: (-1.9) CPU: (-24.9) RCPU: (-9.6) > > I don't see any reasons mentioned above. However, for higher > number of netperf sessions, I see a big increase in retransmissions: Hmm, ok, and do you see any errors? > _______________________________________ > #netperf ORG NEW > BW (#retr) BW (#retr) > _______________________________________ > 1 70244 (0) 64102 (0) > 4 21421 (0) 36570 (416) > 8 21746 (0) 38604 (148) > 16 21783 (0) 40632 (464) > 32 22677 (0) 37163 (1053) > 64 23648 (4) 36449 (2197) > 128 23251 (2) 31676 (3185) > _______________________________________ > > Single netperf case didn't have any retransmissions so that is not > the cause for drop. I tested ixgbe (MQ): > ___________________________________________________________ > #netperf ixgbe ixgbe (pin intrs to cpu#0 on > both server/client) > BW (#retr) BW (#retr) > ___________________________________________________________ > 1 3567 (117) 6000 (251) > 2 4406 (477) 6298 (725) > 4 6119 (1085) 7208 (3387) > 8 6595 (4276) 7381 (15296) > 16 6651 (11651) 6856 (30394) Interesting. You are saying we get much more retransmissions with physical nic as well? > ___________________________________________________________ > > > 5. Test perf in more scenarious: > > small packets > > 512 byte packets - BW drop for upto 8 (sometimes 16) netperf sessions, > but increases with #sessions: > _______________________________________________________________________________ > # BW1 BW2 (%) CPU1 CPU2 (%) RCPU1 RCPU2 (%) > _______________________________________________________________________________ > 1 4043 3800 (-6.0) 50 50 (0) 86 98 (13.9) > 2 8358 7485 (-10.4) 153 178 (16.3) 230 264 (14.7) > 4 20664 13567 (-34.3) 448 490 (9.3) 530 624 (17.7) > 8 25198 17590 (-30.1) 967 1021 (5.5) 1085 1257 (15.8) > 16 23791 24057 (1.1) 1904 2220 (16.5) 2156 2578 (19.5) > 24 23055 26378 (14.4) 2807 3378 (20.3) 3225 3901 (20.9) > 32 22873 27116 (18.5) 3748 4525 (20.7) 4307 5239 (21.6) > 40 22876 29106 (27.2) 4705 5717 (21.5) 5388 6591 (22.3) > 48 23099 31352 (35.7) 5642 6986 (23.8) 6475 8085 (24.8) > 64 22645 30563 (34.9) 7527 9027 (19.9) 8619 10656 (23.6) > 80 22497 31922 (41.8) 9375 11390 (21.4) 10736 13485 (25.6) > 96 22509 32718 (45.3) 11271 13710 (21.6) 12927 16269 (25.8) > 128 22255 32397 (45.5) 15036 18093 (20.3) 17144 21608 (26.0) > _______________________________________________________________________________ > SUM: BW: (16.7) CPU: (20.6) RCPU: (24.3) > _______________________________________________________________________________ > > > host -> guest > _______________________________________________________________________________ > # BW1 BW2 (%) CPU1 CPU2 (%) RCPU1 RCPU2 (%) > _______________________________________________________________________________ > *1 70706 90398 (27.8) 300 327 (9.0) 140 175 (25.0) > 2 20951 21937 (4.7) 188 196 (4.2) 93 103 (10.7) > 4 19952 25281 (26.7) 397 496 (24.9) 210 304 (44.7) > 8 18559 24992 (34.6) 802 1010 (25.9) 439 659 (50.1) > 16 18882 25608 (35.6) 1642 2082 (26.7) 953 1454 (52.5) > 24 19012 26955 (41.7) 2465 3153 (27.9) 1452 2254 (55.2) > 32 19846 26894 (35.5) 3278 4238 (29.2) 1914 3081 (60.9) > 40 19704 27034 (37.2) 4104 5303 (29.2) 2409 3866 (60.4) > 48 19721 26832 (36.0) 4924 6418 (30.3) 2898 4701 (62.2) > 64 19650 26849 (36.6) 6595 8611 (30.5) 3975 6433 (61.8) > 80 19432 26823 (38.0) 8244 10817 (31.2) 4985 8165 (63.7) > 96 20347 27886 (37.0) 9913 13017 (31.3) 5982 9860 (64.8) > 128 19108 27715 (45.0) 13254 17546 (32.3) 8153 13589 (66.6) > _______________________________________________________________________________ > SUM: BW: (32.4) CPU: (30.4) RCPU: (62.6) > _______________________________________________________________________________ > *: Sum over 7 iterations, remaining test cases are sum over 2 iterations > > > guest <-> external > > I haven't done this right now since I don't have a setup. I guess > it would be limited by wire speed and gains may not be there. I > will try to do this later when I get the setup. OK but at least need to check that it does not hurt things. > > in last case: > > find some other way to measure host CPU utilization, > > try multiqueue and single queue devices > > 6. Use above to figure out what is a sane default for numtxqs > > A. Summary for default I/O (16K): > #txqs=2 (#vhost=3): BW: (37.6) CPU: (69.2) RCPU: (40.8) > #txqs=4 (#vhost=5): BW: (36.9) CPU: (60.9) RCPU: (25.2) > #txqs=8 (#vhost=5): BW: (41.8) CPU: (50.0) RCPU: (15.2) > #txqs=16 (#vhost=5): BW: (40.4) CPU: (49.9) RCPU: (10.0) > > B. Summary for 512 byte I/O: > #txqs=2 (#vhost=3): BW: (31.6) CPU: (35.7) RCPU: (28.6) > #txqs=4 (#vhost=5): BW: (5.7) CPU: (27.2) RCPU: (22.7) > #txqs=8 (#vhost=5): BW: (-.6) CPU: (25.1) RCPU: (22.5) > #txqs=16 (#vhost=5): BW: (-6.6) CPU: (24.7) RCPU: (21.7) > > Summary: > > 1. Average BW increase for regular I/O is best for #txq=16 with the > least CPU utilization increase. > 2. The average BW for 512 byte I/O is best for lower #txq=2. For higher > #txqs, BW increased only after a particular #netperf sessions - in > my testing that limit was 32 netperf sessions. > 3. Multiple txq for guest by itself doesn't seem to have any issues. > Guest CPU% increase is slightly higher than BW improvement. I > think it is true for all mq drivers since more paths run in parallel > upto the device instead of sleeping and allowing one thread to send > all packets via qdisc_restart. > 4. Having high number of txqs gives better gains and reduces cpu util > on the guest and the host. > 5. MQ is intended for server loads. MQ should probably not be explicitly > specified for client systems. > 6. No regression with numtxqs=1 (or if mq option is not used) in any > testing scenario. Of course txq=1 can be considered a kind of fix, but if we know the issue is TX/RX flows getting bounced between CPUs, can we fix this? Workload-specific optimizations can only get us this far. > > I will send the v3 patch within a day after some more testing. > > Thanks, > > - KK -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html