On Thu, 2012-04-05 at 15:28 +0300, Michael S. Tsirkin wrote: > On Tue, Mar 27, 2012 at 10:43:03AM -0700, Shirley Ma wrote: > > On Tue, 2012-03-27 at 18:09 +0800, Jason Wang wrote: > > > Hi: > > > > > > Thanks for the work and it looks very reasonable, some questions > > > below. > > Yes I am happy to see the per-cpu work resurrected. > Some comments below. Glad to see you have time on reviewing this. > > > On 03/23/2012 07:48 AM, Shirley Ma wrote: > > > > Sorry for being late to submit this patch. I have spent lots of > time > > > > trying to find the best approach. This effort is still going > on... > > > > > > > > This patch is built against net-next tree. > > > > > > > > This is an experimental RFC patch. The purpose of this patch is > to > > > > address KVM networking scalability and NUMA scheduling issue. > > > > > > Need also test for non-NUMA machine, I see that you just choose > the > > > cpu > > > that initiates the work for non-numa machine which seems sub > optimal. > > > > Good suggestions. I don't have any non-numa systems. But KK run some > > tests on non-numa system. He could see around 20% performance gain > for > > single VMs local host to guest. I hope we can run a full test on > > non-numa system. > > > > On non-numa system, the same per vhost-cpu thread will be always > picked > > up consistently for a particular vq since all cores are on same cpu > > socket. So there will be two per-cpu vhost threads handle TX/RX > > simultaneously. > > > > > > The existing implementation of vhost creats a vhost thread > > > per-device > > > > (virtio_net) based. RX and TX work of a VMs per-device is > handled by > > > > same vhost thread. > > > > > > > > One of the limitation of this implementation is with increasing > the > > > > number VMs or the number of virtio-net interfces, more vhost > threads > > > are > > > > created, it will consume more kernel resources, and induce more > > > threads > > > > context switches/scheduling overhead. We noticed that the KVM > > > network > > > > performance doesn't scale with increasing number of VMs. > > > > > > > > The other limitation is to have single vhost thread to process > both > > > RX > > > > and TX, the work will be blocked. So we create this per cpu > vhost > > > thread > > > > implementation. The number of vhost cpu threads is limited to > the > > > number > > > > of cpus on the host. > > > > > > > > To address these limitations, we are propsing a per-cpu vhost > thread > > > > model where the number of vhost threads are limited and equal to > the > > > > number of online cpus on the host. > > > > > > The number of vhost thread needs more consideration. Consider that > we > > > have a 1024 cores host with a card have 16 tx/rx queues, do we > really > > > need 1024 vhost threads? > > > > In this case, we could add a module parameter to limit the number of > > cores/sockets to be used. > > Hmm. And then which cores would we run on? > Also, is the parameter different between guests? > Another idea is to scale the # of threads on demand. If we are able to pass number of guests/vcpus info to vhost, we can scale the vhost threads. Any API to get this info? > Sharing the same thread between guests is also an > interesting approach, if we did this then per-cpu > won't be so expensive but making this work well > with cgroups would be a challenge. Yes, I am comparing vhost thread pool to share among guests approach with per-cpu vhost approach now. It's challenge to work with cgroups anyway. > > > > > > > > > Based on our testing experience, the vcpus can be scheduled > across > > > cpu > > > > sockets even when the number of vcpus is smaller than the number > of > > > > cores per cpu socket and there is no other activities besides > KVM > > > > networking workload. We found that if vhost thread is scheduled > on > > > the > > > > same socket as the work is received, the performance will be > better. > > > > > > > > So in this per cpu vhost thread implementation, a vhost thread > is > > > > selected dynamically based on where the TX/RX work is initiated. > A > > > vhost > > > > thread on the same cpu socket is selected but not on the same > cpu as > > > the > > > > vcpu/interrupt thread that initizated the TX/RX work. > > > > > > > > When we test this RFC patch, the other interesting thing we > found is > > > the > > > > performance results also seem related to NIC flow steering. We > are > > > > spending time on evaluate different NICs flow director > > > implementation > > > > now. We will enhance this patch based on our findings later. > > > > > > > > We have tried different scheduling: per-device based, per vq > based > > > and > > > > per work type (tx_kick, rx_kick, tx_net, rx_net) based vhost > > > scheduling, > > > > we found that so far the per vq based scheduling is good enough > for > > > now. > > > > > > Could you please explain more about those scheduling strategies? > Does > > > per-device based means let a dedicated vhost thread to handle all > > > work > > > from that vhost device? As you mentioned, maybe an improvement of > the > > > scheduling to take flow steering info (queue mapping, rxhash etc.) > of > > > skb in host into account. > > > > Yes, per-device scheduling means one per-cpu vhost theads handle all > > works from one particular vhost-device. > > > > Yes, we think scheduling to take flow steering info would help > > performance. I am studying this now. > > Did anything interesing turn up? Not yet, still investigating. > > > > > > > > > We also tried different algorithm to select which cpu vhost > thread > > > will > > > > running on a specific cpu socket: avg_load balance, and > randomly... > > > > > > May worth to account the out-of-oder packet during the test as for > a > > > single stream as different cpu/vhost/physical queue may be chose > to > > > do > > > the packet transmission/reception? > > > > Good point. I haven't gone through all data yet. netstat output > might > > tell us something. > > > > We used Intel 10G NIC to run all test. For a single steam test, > Intel > > NIC receiving irq steers with same irq/queue which TX packets have > been > > sent. So when we mask vcpus from same VM on one socket, we shouldn't > hit > > packet out-of-order case. We might hit packet out of order when > vcpus > > run across sockets. > > > > > > > > > > > From our test results, we found that the scalability has been > > > > significantly improved. And this patch is also helpful for small > > > packets > > > > performance. > > > > > > > > Hoever, we are seeing some regressions in a local guest to guest > > > > scenario on a 8 cpu NUMA system. > > > > In one case, 24 VMs 256 bytes tcp_stream test shows it has > improved > > > from > > > > 810Mb/s to 9.1Gb/s. :) > > > > (We created two local VMs, and each VM has 2 vcpus. W/o this > patch, > > > the > > > > number of threads is 4 vcpus + 2 vhosts = 6, w/i this patch is 4 > > > vcpus + > > > > 8 vhosts = 12. It causes more context switches. When I change > the > > > > scheduling to use 2-4 vhost threads, the regressions are gone. I > am > > > > continue investigation on how to make small number of VMs, local > > > guest > > > > to gues performance better. Once I find the clue, I will share > > > here.) > > So, that's one obvious reason. But there could be other explanations: > 1. You explicitly mask out the same CPU. But if the socket > is very small (it's likely each socket is 2 CPUs or even 1 here), > this might limit the scheduler drastically. Only if we limit guest vcpus on same socket. The default host schedules vcpus across sockets. > 2. If guest ends up running on the same socket, you cause > more IPIs which cause exists for the other guest. I used different approaches to schedule vhost thread: 1. check loadavg on a particular cpu; 2. randomly pick up a cpu, the performance didn't make much difference in a small amount of VMs. On Tom's 1-24 VMs scalability test, it had impressive results when amount of VMs are increased compared to existing approach. So it might not be a big issue. > > > > > > > > The cpu hotplug support hasn't in place yet. I will post it > later. > > Not yet done, right? Done now, under testing. > > > Another question is why not just using workqueue? It has full > support > > > for cpu hotplug and allow more polices. > > > > Yes, it's good to use workqueue. I just did everything on top of > current > > implementation so it's easy to compare/analyze the performance data. > > > > I remembered the vhost implementation changed from workqueue to > thread > > for some reason. I couldn't recall the reason. > > At the time the implementation didn't perform well with per-cpu > threads. We wanted a single thread so switched to use just that. > > > > > > > > > Since we have per cpu vhost thread, each vhost thread will > handle > > > > multiple vqs, so we will be able to reduce/remove vq > notification > > > when > > > > the work is heavy loaded in future. > > > > > > Does this issue still exist if event index is used? If vhost does > not > > > publish new used index, guest would not kick again. > > > > Since the vhost model has been changed to handle multiple VMs' vqs > work, > > then it's not necessary to enable these VMs' vqs notification > (published > > new used idex) where these vqs' future work will be processed on the > > same per-cpu vhost thread, as long as the per-cpu vhost thread is > still > > running. > > > > > > > > > > Here is my test results for remote host to guest test: tcp_rrs, > > > udp_rrs, > > > > tcp_stream with guest has 2 vpus, host has two cpu socket, each > > > socket > > > > has 4 cores. > > > > > > > > TCP_STREAM 256 512 1K 2K 4K 8K > 16K > > > > > -------------------------------------------------------------------- > > > > Original > > > > > > > H->Guest 2501 4238 4744 5256 7203 6975 5799 > Patch > > > > > > > H->Guest 1676 2290 3149 8026 8439 8283 > 8216 > > > > > > > > Original > > > > > > > Guest->H 744 1773 5675 1397 8207 7296 > 8117 > > > > Patch > > > > Guest->Host 1041 1386 5407 7057 8298 8127 > 8241 > > > > > > Looks like there's some noise in the result, the throughput of > > > "original > > > guest -> Host 2K" looks too low. And some strange is that I see > > > regressions of packet transmission of guest when testing this > patch. > > > ( > > > Guest to Local Host TCP_STREAM in a NUMA machine). > > > > Yes, since I didn't mask the vcpus on the same socket, it might come > > from packets out of order. I will rerun the test w/i masking vcpus > on > > the same socket to see any difference. > > Did anything interesting turn up? Haven't had time to focus on single stream result yet. > > > You can reference Tom's results. His test is more formal than mine. > > > > > > > > > > 60 instances TCP_RRs: Patch 150K trans/s vs. 91K trans/sec > > > > 65% improved with taskset vcpus on the same socket > > > > 60 instances UDP_RRs: Patch 172K trans/s vs. 103K trans/s > > > > 67% improved with taskset vcpus on the same socket > > > > > > > > Tom has run 1VM to 24 VMs test for different work. He will post > it > > > here > > > > soon. > > > > > > > > If the host scheduler ensures that the VM's vcpus are not > scheduled > > > to > > > > another socket (i.e. cpu mask the vcpus on same socket) then the > > > > performance will be better. > > > > > > > > Signed-off-by: Shirley Ma<xma@xxxxxxxxxx> > > > > Signed-off-by: Krishna Kumar<krkumar2@xxxxxxxxxx> > > > > Tested-by: Tom Lendacky<toml@xxxxxxxxxx> > > > > --- > > > > > > > > drivers/vhost/net.c | 26 ++- > > > > drivers/vhost/vhost.c | 289 > > > > +++++++++++++++++++++++---------- > > > > drivers/vhost/vhost.h | 16 ++- > > > > 3 files changed, 232 insertions(+), 103 deletions(-) > > > > > > > > Thanks > > > > Shirley > > > > > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe > netdev" in > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > More majordomo info at > http://vger.kernel.org/majordomo-info.html > > > > > > > > Also a question: how does this interact with zero copy tx? Yes, I tested this with zero copy tx. The vhost thread which handles tx work has been significantly reduced. Thanks Shirley -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html