On 2014/11/26 18:22, Marc Zyngier wrote: > On 26/11/14 09:16, Christoffer Dall wrote: >> On Wed, Nov 26, 2014 at 04:31:44PM +0800, Shannon Zhao wrote: >>> On 2014/11/26 15:24, Christoffer Dall wrote: >>>> On Wed, Nov 26, 2014 at 02:13:32PM +0800, Shannon Zhao wrote: >>>>> On 2014/11/25 19:55, Marc Zyngier wrote: >>>>>> On 25/11/14 11:49, Christoffer Dall wrote: >>>>>>> On Tue, Nov 25, 2014 at 11:24:43AM +0000, Marc Zyngier wrote: >>>>>>>> On 25/11/14 11:11, Christoffer Dall wrote: >>>>>>>>> On Tue, Nov 25, 2014 at 10:54:18AM +0800, Shannon Zhao wrote: >>>>>>>>>> On 2014/11/24 18:53, Christoffer Dall wrote: >>>>>>>>>>> On Mon, Nov 24, 2014 at 03:53:16PM +0800, Shannon Zhao wrote: >>>>>>>>>>>> Hi Marc, Christoffer, >>>>>>>>>>>> >>>>>>>>>>>> On 2014/11/23 4:04, Christoffer Dall wrote: >>>>>>>>>>>>> On Wed, Nov 19, 2014 at 06:11:25PM +0800, Shannon Zhao wrote: >>>>>>>>>>>>>> When call kvm_vgic_inject_irq to inject interrupt, we can known which >>>>>>>>>>>>>> vcpu the interrupt for by the irq_num and the cpuid. So we should just >>>>>>>>>>>>>> kick this vcpu to avoid iterating through all. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Signed-off-by: Shannon Zhao <zhaoshenglong@xxxxxxxxxx> >>>>>>>>>>>>> >>>>>>>>>>>>> This looks reasonable to me: >>>>>>>>>>>>> >>>>>>>>>>>>> Reviewed-by: Christoffer Dall <christoffer.dall@xxxxxxxxxx> >>>>>>>>>>>>> >>>>>>>>>>>>> But as Marc said, we have to consider the churn by introducing more >>>>>>>>>>>>> changes to the vgic (that file is being hammered pretty intensely >>>>>>>>>>>>> these days), so if you feel this is an urgent optimization, it would >>>>>>>>>>>>> be useful to see some data backing this up. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Today I have a test which measures the cycles about kvm_vgic_inject_irq by PMU. >>>>>>>>>>>> I just test the cycles of SPI using virtio-net. >>>>>>>>>>>> Test steps: >>>>>>>>>>>> 1) start a VM with 8 VCPUs >>>>>>>>>>>> 2) In guest bind the irq of virtio to CPU8, host ping VM, get the cycles >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> The test shows: >>>>>>>>>>>> Without this patch, the cycles is about 3700(3300-5000), and with this patch, the cycles is about 3000(2500-3200). >>>>>>>>>>>> From this test, I think this patch can bring some improvements. >>>>>>>>>>> >>>>>>>>>>> Are these averaged numbers? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Yes:-) >>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> The test code like below. As it's almost no difference about vgic_update_irq_state >>>>>>>>>>>> between with and without this patch. So just measure the kick's cycles. >>>>>>>>>>>> >>>>>>>>>>>> int kvm_vgic_inject_irq(struct kvm *kvm, int cpuid, unsigned int irq_num, >>>>>>>>>>>> bool level) >>>>>>>>>>>> { >>>>>>>>>>>> unsigned long cycles_1,cycles_2; >>>>>>>>>>>> if (likely(vgic_initialized(kvm)) && >>>>>>>>>>>> vgic_update_irq_pending(kvm, cpuid, irq_num, level)) { >>>>>>>>>>>> start_pmu(); >>>>>>>>>>>> __asm__ __volatile__("MRS %0, PMCCNTR_EL0" : "=r"(cycles_1)); >>>>>>>>>>>> vgic_kick_vcpus(kvm); >>>>>>>>>>>> __asm__ __volatile__("MRS %0, PMCCNTR_EL0" : "=r"(cycles_2)); >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> return 0; >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> int kvm_vgic_inject_irq(struct kvm *kvm, int cpuid, unsigned int irq_num, >>>>>>>>>>>> bool level) >>>>>>>>>>>> { >>>>>>>>>>>> int vcpu_id; >>>>>>>>>>>> unsigned long cycles_a,cycles_b; >>>>>>>>>>>> if (likely(vgic_initialized(kvm))) { >>>>>>>>>>>> vcpu_id = vgic_update_irq_pending(kvm, cpuid, irq_num, level); >>>>>>>>>>>> if (vcpu_id >= 0) { >>>>>>>>>>>> start_pmu(); >>>>>>>>>>>> __asm__ __volatile__("MRS %0, PMCCNTR_EL0" : "=r"(cycles_a)); >>>>>>>>>>>> /* kick the specified vcpu */ >>>>>>>>>>>> kvm_vcpu_kick(kvm_get_vcpu(kvm, vcpu_id)); >>>>>>>>>>>> __asm__ __volatile__("MRS %0, PMCCNTR_EL0" : "=r"(cycles_b)); >>>>>>>>>>>> } >>>>>>>>>>>> } >>>>>>>>>>>> return 0; >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Can you run some IPI-intensive benchmark in your guest and let us know >>>>>>>>>>> if you see improvements on that level? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Cool, I'll try to find some benchmarks and run. Are there some IPI-intensive benchmarks you suggest? >>>>>>>>>> >>>>>>>>> >>>>>>>>> Hackbench with processes sure seems to like IPIs. >>>>>>>> >>>>>>>> But that'd mostly be IPIs in the guest, and we're hoping for that patch >>>>>>>> to result in a reduction in the number of IPIs on the host when >>>>>>>> interrupts are injected. >>>>>>> >>>>>>> Ah right, I remembered the SGI handling register calling >>>>>>> kvm_vgic_inject_irq(), but it doesn't. >>>>>>> >>>>>>>> >>>>>>>> I guess that having a workload that generates many interrupts on a SMP >>>>>>>> guest should result in a reduction of the number of IPIs on the host. >>>>>>>> >>>>>>>> What do you think? >>>>>>>> >>>>>>> That's sort of what Shannon did already, only we need to measure a drop >>>>>>> in overall cpu utilization on the host instead or look at iperf numbers >>>>>>> or something like that. Right? >>>>>> >>>>>> Yes. I guess that vmstat running in the background on the host should >>>>>> give a good indication of what is going on. >>>>>> >>>>> >>>>> I just measure the overall cpu utilization on the host using vmstat and iperf. >>>>> I start a VM with 8 vcpus and use iperf to send packets from host to guest. >>>>> Bind the interrupt of virtio to cpu0 and cpu7. >>>>> >>>>> The result is following: >>>>> >>>>> Without this patch: >>>>> Bind to cpu0: >>>>> Bandwidth : 6.60 Gbits/sec >>>>> vmstat data: >>>>> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- >>>>> r b swpd free buff cache si so bi bo in cs us sy id wa >>>>> 2 0 0 7795456 0 120568 0 0 0 0 8967 11405 2 2 96 0 >>>>> Bind to cpu7: >>>>> Bandwidth : 6.13 Gbits/sec >>>>> vmstat data: >>>>> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- >>>>> r b swpd free buff cache si so bi bo in cs us sy id wa >>>>> 1 0 0 7795016 0 120572 0 0 0 0 14633 20710 2 3 95 0 >>>>> >>>>> With this patch: >>>>> Bind to cpu0: >>>>> Bandwidth : 6.99 Gbits/sec >>>>> vmstat data: >>>>> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- >>>>> r b swpd free buff cache si so bi bo in cs us sy id wa >>>>> 1 0 0 7788048 0 124836 0 0 0 0 10149 11593 2 2 96 0 >>>>> Bind to cpu7: >>>>> Bandwidth : 6.53 Gbits/sec >>>>> vmstat data: >>>>> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- >>>>> r b swpd free buff cache si so bi bo in cs us sy id wa >>>>> 1 0 0 7791044 0 124832 0 0 0 0 11408 14179 2 2 96 0 >>>>> >>>>> From the data, it has some improvement :-) >>>>> >>>> Indeed things are promising, but I have some questions: >>>> >>>> - Are these numbers consistent (average over how many runs, what's the >>>> variance?) >>>> - Why the seemlingly consistent difference between binding to cpu0 and >>>> cpu7? >>> >>> The vmstat command line is "vmstat 1 1000". And I just paste the common numbers not the average. >>> The difference between with and without this patch when binding to cpu7 is obvious. But the >>> difference of binding to cpu0 is not. >>> >>>> - what were the iperf parameters? >>> >>> I forgot to paste it. >>> ./iperf -c 192.168.0.2 -P 1 -p 5001 -f G -t 60 >>> >>>> - the number of context switches going down for cpu7 from with/without >>>> the patch is interesting, but the fact that we don't see this for >>>> cpu0 makes me doubt that it's meaningful. >>>> >>> >>> Ah, Sorry that I don't get the average numbers. With/without this patch the number of context >>> switches for binding to cpu0 is almost no difference. The numbers are at the same level. >>> >>>> Based purely on the bandwidth, and assuming those numbers are >>>> consistent, I would say we should go ahead and merge this. >>>> >>> >>> Thanks. The numbers of bandwidth are average and consistent. >>> >> So I would say "ship it", Marc? > > Yes, this is convincing enough. Shannon, thanks for going the extra > mile, running the tests and sharing the figures. > > Applied to -next > Marc, Christoffer, Thanks for your help and patience. Shannon _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/mailman/listinfo/kvmarm