On 26/11/14 09:16, Christoffer Dall wrote: > On Wed, Nov 26, 2014 at 04:31:44PM +0800, Shannon Zhao wrote: >> On 2014/11/26 15:24, Christoffer Dall wrote: >>> On Wed, Nov 26, 2014 at 02:13:32PM +0800, Shannon Zhao wrote: >>>> On 2014/11/25 19:55, Marc Zyngier wrote: >>>>> On 25/11/14 11:49, Christoffer Dall wrote: >>>>>> On Tue, Nov 25, 2014 at 11:24:43AM +0000, Marc Zyngier wrote: >>>>>>> On 25/11/14 11:11, Christoffer Dall wrote: >>>>>>>> On Tue, Nov 25, 2014 at 10:54:18AM +0800, Shannon Zhao wrote: >>>>>>>>> On 2014/11/24 18:53, Christoffer Dall wrote: >>>>>>>>>> On Mon, Nov 24, 2014 at 03:53:16PM +0800, Shannon Zhao wrote: >>>>>>>>>>> Hi Marc, Christoffer, >>>>>>>>>>> >>>>>>>>>>> On 2014/11/23 4:04, Christoffer Dall wrote: >>>>>>>>>>>> On Wed, Nov 19, 2014 at 06:11:25PM +0800, Shannon Zhao wrote: >>>>>>>>>>>>> When call kvm_vgic_inject_irq to inject interrupt, we can known which >>>>>>>>>>>>> vcpu the interrupt for by the irq_num and the cpuid. So we should just >>>>>>>>>>>>> kick this vcpu to avoid iterating through all. >>>>>>>>>>>>> >>>>>>>>>>>>> Signed-off-by: Shannon Zhao <zhaoshenglong@xxxxxxxxxx> >>>>>>>>>>>> >>>>>>>>>>>> This looks reasonable to me: >>>>>>>>>>>> >>>>>>>>>>>> Reviewed-by: Christoffer Dall <christoffer.dall@xxxxxxxxxx> >>>>>>>>>>>> >>>>>>>>>>>> But as Marc said, we have to consider the churn by introducing more >>>>>>>>>>>> changes to the vgic (that file is being hammered pretty intensely >>>>>>>>>>>> these days), so if you feel this is an urgent optimization, it would >>>>>>>>>>>> be useful to see some data backing this up. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Today I have a test which measures the cycles about kvm_vgic_inject_irq by PMU. >>>>>>>>>>> I just test the cycles of SPI using virtio-net. >>>>>>>>>>> Test steps: >>>>>>>>>>> 1) start a VM with 8 VCPUs >>>>>>>>>>> 2) In guest bind the irq of virtio to CPU8, host ping VM, get the cycles >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> The test shows: >>>>>>>>>>> Without this patch, the cycles is about 3700(3300-5000), and with this patch, the cycles is about 3000(2500-3200). >>>>>>>>>>> From this test, I think this patch can bring some improvements. >>>>>>>>>> >>>>>>>>>> Are these averaged numbers? >>>>>>>>>> >>>>>>>>> >>>>>>>>> Yes:-) >>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> The test code like below. As it's almost no difference about vgic_update_irq_state >>>>>>>>>>> between with and without this patch. So just measure the kick's cycles. >>>>>>>>>>> >>>>>>>>>>> int kvm_vgic_inject_irq(struct kvm *kvm, int cpuid, unsigned int irq_num, >>>>>>>>>>> bool level) >>>>>>>>>>> { >>>>>>>>>>> unsigned long cycles_1,cycles_2; >>>>>>>>>>> if (likely(vgic_initialized(kvm)) && >>>>>>>>>>> vgic_update_irq_pending(kvm, cpuid, irq_num, level)) { >>>>>>>>>>> start_pmu(); >>>>>>>>>>> __asm__ __volatile__("MRS %0, PMCCNTR_EL0" : "=r"(cycles_1)); >>>>>>>>>>> vgic_kick_vcpus(kvm); >>>>>>>>>>> __asm__ __volatile__("MRS %0, PMCCNTR_EL0" : "=r"(cycles_2)); >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> return 0; >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> int kvm_vgic_inject_irq(struct kvm *kvm, int cpuid, unsigned int irq_num, >>>>>>>>>>> bool level) >>>>>>>>>>> { >>>>>>>>>>> int vcpu_id; >>>>>>>>>>> unsigned long cycles_a,cycles_b; >>>>>>>>>>> if (likely(vgic_initialized(kvm))) { >>>>>>>>>>> vcpu_id = vgic_update_irq_pending(kvm, cpuid, irq_num, level); >>>>>>>>>>> if (vcpu_id >= 0) { >>>>>>>>>>> start_pmu(); >>>>>>>>>>> __asm__ __volatile__("MRS %0, PMCCNTR_EL0" : "=r"(cycles_a)); >>>>>>>>>>> /* kick the specified vcpu */ >>>>>>>>>>> kvm_vcpu_kick(kvm_get_vcpu(kvm, vcpu_id)); >>>>>>>>>>> __asm__ __volatile__("MRS %0, PMCCNTR_EL0" : "=r"(cycles_b)); >>>>>>>>>>> } >>>>>>>>>>> } >>>>>>>>>>> return 0; >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Can you run some IPI-intensive benchmark in your guest and let us know >>>>>>>>>> if you see improvements on that level? >>>>>>>>>> >>>>>>>>> >>>>>>>>> Cool, I'll try to find some benchmarks and run. Are there some IPI-intensive benchmarks you suggest? >>>>>>>>> >>>>>>>> >>>>>>>> Hackbench with processes sure seems to like IPIs. >>>>>>> >>>>>>> But that'd mostly be IPIs in the guest, and we're hoping for that patch >>>>>>> to result in a reduction in the number of IPIs on the host when >>>>>>> interrupts are injected. >>>>>> >>>>>> Ah right, I remembered the SGI handling register calling >>>>>> kvm_vgic_inject_irq(), but it doesn't. >>>>>> >>>>>>> >>>>>>> I guess that having a workload that generates many interrupts on a SMP >>>>>>> guest should result in a reduction of the number of IPIs on the host. >>>>>>> >>>>>>> What do you think? >>>>>>> >>>>>> That's sort of what Shannon did already, only we need to measure a drop >>>>>> in overall cpu utilization on the host instead or look at iperf numbers >>>>>> or something like that. Right? >>>>> >>>>> Yes. I guess that vmstat running in the background on the host should >>>>> give a good indication of what is going on. >>>>> >>>> >>>> I just measure the overall cpu utilization on the host using vmstat and iperf. >>>> I start a VM with 8 vcpus and use iperf to send packets from host to guest. >>>> Bind the interrupt of virtio to cpu0 and cpu7. >>>> >>>> The result is following: >>>> >>>> Without this patch: >>>> Bind to cpu0: >>>> Bandwidth : 6.60 Gbits/sec >>>> vmstat data: >>>> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- >>>> r b swpd free buff cache si so bi bo in cs us sy id wa >>>> 2 0 0 7795456 0 120568 0 0 0 0 8967 11405 2 2 96 0 >>>> Bind to cpu7: >>>> Bandwidth : 6.13 Gbits/sec >>>> vmstat data: >>>> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- >>>> r b swpd free buff cache si so bi bo in cs us sy id wa >>>> 1 0 0 7795016 0 120572 0 0 0 0 14633 20710 2 3 95 0 >>>> >>>> With this patch: >>>> Bind to cpu0: >>>> Bandwidth : 6.99 Gbits/sec >>>> vmstat data: >>>> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- >>>> r b swpd free buff cache si so bi bo in cs us sy id wa >>>> 1 0 0 7788048 0 124836 0 0 0 0 10149 11593 2 2 96 0 >>>> Bind to cpu7: >>>> Bandwidth : 6.53 Gbits/sec >>>> vmstat data: >>>> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- >>>> r b swpd free buff cache si so bi bo in cs us sy id wa >>>> 1 0 0 7791044 0 124832 0 0 0 0 11408 14179 2 2 96 0 >>>> >>>> From the data, it has some improvement :-) >>>> >>> Indeed things are promising, but I have some questions: >>> >>> - Are these numbers consistent (average over how many runs, what's the >>> variance?) >>> - Why the seemlingly consistent difference between binding to cpu0 and >>> cpu7? >> >> The vmstat command line is "vmstat 1 1000". And I just paste the common numbers not the average. >> The difference between with and without this patch when binding to cpu7 is obvious. But the >> difference of binding to cpu0 is not. >> >>> - what were the iperf parameters? >> >> I forgot to paste it. >> ./iperf -c 192.168.0.2 -P 1 -p 5001 -f G -t 60 >> >>> - the number of context switches going down for cpu7 from with/without >>> the patch is interesting, but the fact that we don't see this for >>> cpu0 makes me doubt that it's meaningful. >>> >> >> Ah, Sorry that I don't get the average numbers. With/without this patch the number of context >> switches for binding to cpu0 is almost no difference. The numbers are at the same level. >> >>> Based purely on the bandwidth, and assuming those numbers are >>> consistent, I would say we should go ahead and merge this. >>> >> >> Thanks. The numbers of bandwidth are average and consistent. >> > So I would say "ship it", Marc? Yes, this is convincing enough. Shannon, thanks for going the extra mile, running the tests and sharing the figures. Applied to -next Thanks, M. -- Jazz is not dead. It just smells funny... _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/mailman/listinfo/kvmarm