On Tue, Mar 27, 2012 at 10:43:03AM -0700, Shirley Ma wrote: > On Tue, 2012-03-27 at 18:09 +0800, Jason Wang wrote: > > Hi: > > > > Thanks for the work and it looks very reasonable, some questions > > below. Yes I am happy to see the per-cpu work resurrected. Some comments below. > > On 03/23/2012 07:48 AM, Shirley Ma wrote: > > > Sorry for being late to submit this patch. I have spent lots of time > > > trying to find the best approach. This effort is still going on... > > > > > > This patch is built against net-next tree. > > > > > > This is an experimental RFC patch. The purpose of this patch is to > > > address KVM networking scalability and NUMA scheduling issue. > > > > Need also test for non-NUMA machine, I see that you just choose the > > cpu > > that initiates the work for non-numa machine which seems sub optimal. > > Good suggestions. I don't have any non-numa systems. But KK run some > tests on non-numa system. He could see around 20% performance gain for > single VMs local host to guest. I hope we can run a full test on > non-numa system. > > On non-numa system, the same per vhost-cpu thread will be always picked > up consistently for a particular vq since all cores are on same cpu > socket. So there will be two per-cpu vhost threads handle TX/RX > simultaneously. > > > > The existing implementation of vhost creats a vhost thread > > per-device > > > (virtio_net) based. RX and TX work of a VMs per-device is handled by > > > same vhost thread. > > > > > > One of the limitation of this implementation is with increasing the > > > number VMs or the number of virtio-net interfces, more vhost threads > > are > > > created, it will consume more kernel resources, and induce more > > threads > > > context switches/scheduling overhead. We noticed that the KVM > > network > > > performance doesn't scale with increasing number of VMs. > > > > > > The other limitation is to have single vhost thread to process both > > RX > > > and TX, the work will be blocked. So we create this per cpu vhost > > thread > > > implementation. The number of vhost cpu threads is limited to the > > number > > > of cpus on the host. > > > > > > To address these limitations, we are propsing a per-cpu vhost thread > > > model where the number of vhost threads are limited and equal to the > > > number of online cpus on the host. > > > > The number of vhost thread needs more consideration. Consider that we > > have a 1024 cores host with a card have 16 tx/rx queues, do we really > > need 1024 vhost threads? > > In this case, we could add a module parameter to limit the number of > cores/sockets to be used. Hmm. And then which cores would we run on? Also, is the parameter different between guests? Another idea is to scale the # of threads on demand. Sharing the same thread between guests is also an interesting approach, if we did this then per-cpu won't be so expensive but making this work well with cgroups would be a challenge. > > > > > > Based on our testing experience, the vcpus can be scheduled across > > cpu > > > sockets even when the number of vcpus is smaller than the number of > > > cores per cpu socket and there is no other activities besides KVM > > > networking workload. We found that if vhost thread is scheduled on > > the > > > same socket as the work is received, the performance will be better. > > > > > > So in this per cpu vhost thread implementation, a vhost thread is > > > selected dynamically based on where the TX/RX work is initiated. A > > vhost > > > thread on the same cpu socket is selected but not on the same cpu as > > the > > > vcpu/interrupt thread that initizated the TX/RX work. > > > > > > When we test this RFC patch, the other interesting thing we found is > > the > > > performance results also seem related to NIC flow steering. We are > > > spending time on evaluate different NICs flow director > > implementation > > > now. We will enhance this patch based on our findings later. > > > > > > We have tried different scheduling: per-device based, per vq based > > and > > > per work type (tx_kick, rx_kick, tx_net, rx_net) based vhost > > scheduling, > > > we found that so far the per vq based scheduling is good enough for > > now. > > > > Could you please explain more about those scheduling strategies? Does > > per-device based means let a dedicated vhost thread to handle all > > work > > from that vhost device? As you mentioned, maybe an improvement of the > > scheduling to take flow steering info (queue mapping, rxhash etc.) of > > skb in host into account. > > Yes, per-device scheduling means one per-cpu vhost theads handle all > works from one particular vhost-device. > > Yes, we think scheduling to take flow steering info would help > performance. I am studying this now. Did anything interesing turn up? > > > > > > We also tried different algorithm to select which cpu vhost thread > > will > > > running on a specific cpu socket: avg_load balance, and randomly... > > > > May worth to account the out-of-oder packet during the test as for a > > single stream as different cpu/vhost/physical queue may be chose to > > do > > the packet transmission/reception? > > Good point. I haven't gone through all data yet. netstat output might > tell us something. > > We used Intel 10G NIC to run all test. For a single steam test, Intel > NIC receiving irq steers with same irq/queue which TX packets have been > sent. So when we mask vcpus from same VM on one socket, we shouldn't hit > packet out-of-order case. We might hit packet out of order when vcpus > run across sockets. > > > > > > > > From our test results, we found that the scalability has been > > > significantly improved. And this patch is also helpful for small > > packets > > > performance. > > > > > > Hoever, we are seeing some regressions in a local guest to guest > > > scenario on a 8 cpu NUMA system. > > > In one case, 24 VMs 256 bytes tcp_stream test shows it has improved > > from > > > 810Mb/s to 9.1Gb/s. :) > > > (We created two local VMs, and each VM has 2 vcpus. W/o this patch, > > the > > > number of threads is 4 vcpus + 2 vhosts = 6, w/i this patch is 4 > > vcpus + > > > 8 vhosts = 12. It causes more context switches. When I change the > > > scheduling to use 2-4 vhost threads, the regressions are gone. I am > > > continue investigation on how to make small number of VMs, local > > guest > > > to gues performance better. Once I find the clue, I will share > > here.) So, that's one obvious reason. But there could be other explanations: 1. You explicitly mask out the same CPU. But if the socket is very small (it's likely each socket is 2 CPUs or even 1 here), this might limit the scheduler drastically. 2. If guest ends up running on the same socket, you cause more IPIs which cause exists for the other guest. > > > > > > The cpu hotplug support hasn't in place yet. I will post it later. Not yet done, right? > > Another question is why not just using workqueue? It has full support > > for cpu hotplug and allow more polices. > > Yes, it's good to use workqueue. I just did everything on top of current > implementation so it's easy to compare/analyze the performance data. > > I remembered the vhost implementation changed from workqueue to thread > for some reason. I couldn't recall the reason. At the time the implementation didn't perform well with per-cpu threads. We wanted a single thread so switched to use just that. > > > > > > Since we have per cpu vhost thread, each vhost thread will handle > > > multiple vqs, so we will be able to reduce/remove vq notification > > when > > > the work is heavy loaded in future. > > > > Does this issue still exist if event index is used? If vhost does not > > publish new used index, guest would not kick again. > > Since the vhost model has been changed to handle multiple VMs' vqs work, > then it's not necessary to enable these VMs' vqs notification (published > new used idex) where these vqs' future work will be processed on the > same per-cpu vhost thread, as long as the per-cpu vhost thread is still > running. > > > > > > > Here is my test results for remote host to guest test: tcp_rrs, > > udp_rrs, > > > tcp_stream with guest has 2 vpus, host has two cpu socket, each > > socket > > > has 4 cores. > > > > > > TCP_STREAM 256 512 1K 2K 4K 8K 16K > > > -------------------------------------------------------------------- > > > Original > > > > > H->Guest 2501 4238 4744 5256 7203 6975 5799 Patch > > > > > H->Guest 1676 2290 3149 8026 8439 8283 8216 > > > > > > Original > > > > > Guest->H 744 1773 5675 1397 8207 7296 8117 > > > Patch > > > Guest->Host 1041 1386 5407 7057 8298 8127 8241 > > > > Looks like there's some noise in the result, the throughput of > > "original > > guest -> Host 2K" looks too low. And some strange is that I see > > regressions of packet transmission of guest when testing this patch. > > ( > > Guest to Local Host TCP_STREAM in a NUMA machine). > > Yes, since I didn't mask the vcpus on the same socket, it might come > from packets out of order. I will rerun the test w/i masking vcpus on > the same socket to see any difference. Did anything interesting turn up? > You can reference Tom's results. His test is more formal than mine. > > > > > > > 60 instances TCP_RRs: Patch 150K trans/s vs. 91K trans/sec > > > 65% improved with taskset vcpus on the same socket > > > 60 instances UDP_RRs: Patch 172K trans/s vs. 103K trans/s > > > 67% improved with taskset vcpus on the same socket > > > > > > Tom has run 1VM to 24 VMs test for different work. He will post it > > here > > > soon. > > > > > > If the host scheduler ensures that the VM's vcpus are not scheduled > > to > > > another socket (i.e. cpu mask the vcpus on same socket) then the > > > performance will be better. > > > > > > Signed-off-by: Shirley Ma<xma@xxxxxxxxxx> > > > Signed-off-by: Krishna Kumar<krkumar2@xxxxxxxxxx> > > > Tested-by: Tom Lendacky<toml@xxxxxxxxxx> > > > --- > > > > > > drivers/vhost/net.c | 26 ++- > > > drivers/vhost/vhost.c | 289 > > > +++++++++++++++++++++++---------- > > > drivers/vhost/vhost.h | 16 ++- > > > 3 files changed, 232 insertions(+), 103 deletions(-) > > > > > > Thanks > > > Shirley > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > Also a question: how does this interact with zero copy tx? -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html