Re: [PATCH v3] arm/arm64: KVM: vgic: kick the specific vcpu instead of iterating through all

Marc Zyngier <marc.zyngier@xxxxxxx> · Wed, 26 Nov 2014 10:22:58 +0000

On 26/11/14 09:16, Christoffer Dall wrote:
> On Wed, Nov 26, 2014 at 04:31:44PM +0800, Shannon Zhao wrote:
>> On 2014/11/26 15:24, Christoffer Dall wrote:
>>> On Wed, Nov 26, 2014 at 02:13:32PM +0800, Shannon Zhao wrote:
>>>> On 2014/11/25 19:55, Marc Zyngier wrote:
>>>>> On 25/11/14 11:49, Christoffer Dall wrote:
>>>>>> On Tue, Nov 25, 2014 at 11:24:43AM +0000, Marc Zyngier wrote:
>>>>>>> On 25/11/14 11:11, Christoffer Dall wrote:
>>>>>>>> On Tue, Nov 25, 2014 at 10:54:18AM +0800, Shannon Zhao wrote:
>>>>>>>>> On 2014/11/24 18:53, Christoffer Dall wrote:
>>>>>>>>>> On Mon, Nov 24, 2014 at 03:53:16PM +0800, Shannon Zhao wrote:
>>>>>>>>>>> Hi Marc, Christoffer,
>>>>>>>>>>>
>>>>>>>>>>> On 2014/11/23 4:04, Christoffer Dall wrote:
>>>>>>>>>>>> On Wed, Nov 19, 2014 at 06:11:25PM +0800, Shannon Zhao wrote:
>>>>>>>>>>>>> When call kvm_vgic_inject_irq to inject interrupt, we can known which
>>>>>>>>>>>>> vcpu the interrupt for by the irq_num and the cpuid. So we should just
>>>>>>>>>>>>> kick this vcpu to avoid iterating through all.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Shannon Zhao <zhaoshenglong@xxxxxxxxxx>
>>>>>>>>>>>>
>>>>>>>>>>>> This looks reasonable to me:
>>>>>>>>>>>>
>>>>>>>>>>>> Reviewed-by: Christoffer Dall <christoffer.dall@xxxxxxxxxx>
>>>>>>>>>>>>
>>>>>>>>>>>> But as Marc said, we have to consider the churn by introducing more
>>>>>>>>>>>> changes to the vgic (that file is being hammered pretty intensely
>>>>>>>>>>>> these days), so if you feel this is an urgent optimization, it would
>>>>>>>>>>>> be useful to see some data backing this up.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Today I have a test which measures the cycles about kvm_vgic_inject_irq by PMU.
>>>>>>>>>>> I just test the cycles of SPI using virtio-net.
>>>>>>>>>>> Test steps:
>>>>>>>>>>> 1) start a VM with 8 VCPUs
>>>>>>>>>>> 2) In guest bind the irq of virtio to CPU8, host ping VM, get the cycles
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The test shows:
>>>>>>>>>>> Without this patch, the cycles is about 3700(3300-5000), and with this patch, the cycles is about 3000(2500-3200).
>>>>>>>>>>> From this test, I think this patch can bring some improvements.
>>>>>>>>>>
>>>>>>>>>> Are these averaged numbers?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes:-)
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The test code like below. As it's almost no difference about vgic_update_irq_state
>>>>>>>>>>> between with and without this patch. So just measure the kick's cycles.
>>>>>>>>>>>
>>>>>>>>>>> int kvm_vgic_inject_irq(struct kvm *kvm, int cpuid, unsigned int irq_num,
>>>>>>>>>>>                         bool level)
>>>>>>>>>>> {
>>>>>>>>>>>         unsigned long cycles_1,cycles_2;
>>>>>>>>>>>         if (likely(vgic_initialized(kvm)) &&
>>>>>>>>>>>             vgic_update_irq_pending(kvm, cpuid, irq_num, level)) {
>>>>>>>>>>>                 start_pmu();
>>>>>>>>>>>                 __asm__ __volatile__("MRS %0, PMCCNTR_EL0" : "=r"(cycles_1));
>>>>>>>>>>>                 vgic_kick_vcpus(kvm);
>>>>>>>>>>>                 __asm__ __volatile__("MRS %0, PMCCNTR_EL0" : "=r"(cycles_2));
>>>>>>>>>>>         }
>>>>>>>>>>>
>>>>>>>>>>>         return 0;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> int kvm_vgic_inject_irq(struct kvm *kvm, int cpuid, unsigned int irq_num,
>>>>>>>>>>>                         bool level)
>>>>>>>>>>> {
>>>>>>>>>>>         int vcpu_id;
>>>>>>>>>>>         unsigned long cycles_a,cycles_b;
>>>>>>>>>>>         if (likely(vgic_initialized(kvm))) {
>>>>>>>>>>>                 vcpu_id = vgic_update_irq_pending(kvm, cpuid, irq_num, level);
>>>>>>>>>>>                 if (vcpu_id >= 0) {
>>>>>>>>>>>                         start_pmu();
>>>>>>>>>>>                         __asm__ __volatile__("MRS %0, PMCCNTR_EL0" : "=r"(cycles_a));
>>>>>>>>>>>                         /* kick the specified vcpu */
>>>>>>>>>>>                         kvm_vcpu_kick(kvm_get_vcpu(kvm, vcpu_id));
>>>>>>>>>>>                         __asm__ __volatile__("MRS %0, PMCCNTR_EL0" : "=r"(cycles_b));
>>>>>>>>>>>                 }
>>>>>>>>>>>         }
>>>>>>>>>>>         return 0;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Can you run some IPI-intensive benchmark in your guest and let us know
>>>>>>>>>> if you see improvements on that level?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cool, I'll try to find some benchmarks and run. Are there some IPI-intensive benchmarks you suggest?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Hackbench with processes sure seems to like IPIs.
>>>>>>>
>>>>>>> But that'd mostly be IPIs in the guest, and we're hoping for that patch
>>>>>>> to result in a reduction in the number of IPIs on the host when
>>>>>>> interrupts are injected.
>>>>>>
>>>>>> Ah right, I remembered the SGI handling register calling
>>>>>> kvm_vgic_inject_irq(), but it doesn't.
>>>>>>
>>>>>>>
>>>>>>> I guess that having a workload that generates many interrupts on a SMP
>>>>>>> guest should result in a reduction of the number of IPIs on the host.
>>>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>> That's sort of what Shannon did already, only we need to measure a drop
>>>>>> in overall cpu utilization on the host instead or look at iperf numbers
>>>>>> or something like that. Right?
>>>>>
>>>>> Yes. I guess that vmstat running in the background on the host should
>>>>> give a good indication of what is going on.
>>>>>
>>>>
>>>> I just measure the overall cpu utilization on the host using vmstat and iperf.
>>>> I start a VM with 8 vcpus and use iperf to send packets from host to guest.
>>>> Bind the interrupt of virtio to cpu0 and cpu7.
>>>>
>>>> The result is following:
>>>>
>>>> Without this patch:
>>>> Bind to cpu0:
>>>> 	Bandwidth : 6.60 Gbits/sec
>>>> 	vmstat data:
>>>> 	procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>>>> 	 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>>>> 	 2  0      0 7795456      0 120568    0    0     0     0 8967 11405  2  2 96  0
>>>> Bind to cpu7:
>>>> 	Bandwidth : 6.13 Gbits/sec
>>>> 	vmstat data:
>>>> 	procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>>>> 	 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>>>> 	 1  0      0 7795016      0 120572    0    0     0     0 14633 20710  2  3 95  0
>>>>
>>>> With this patch:
>>>> Bind to cpu0:
>>>> 	Bandwidth : 6.99 Gbits/sec
>>>> 	vmstat data:
>>>> 	procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>>>> 	 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>>>> 	 1  0      0 7788048      0 124836    0    0     0     0 10149 11593  2  2 96  0
>>>> Bind to cpu7:
>>>> 	Bandwidth : 6.53 Gbits/sec
>>>> 	vmstat data:
>>>> 	procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>>>> 	 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>>>> 	 1  0      0 7791044      0 124832    0    0     0     0 11408 14179  2  2 96  0
>>>>
>>>> From the data, it has some improvement :-)
>>>>
>>> Indeed things are promising, but I have some questions:
>>>
>>>  - Are these numbers consistent (average over how many runs, what's the
>>>    variance?)
>>>  - Why the seemlingly consistent difference between binding to cpu0 and
>>>    cpu7?
>>
>> The vmstat command line is "vmstat 1 1000". And I just paste the common numbers not the average.
>> The difference between with and without this patch when binding to cpu7 is obvious. But the
>> difference of binding to cpu0 is not.
>>
>>>  - what were the iperf parameters?
>>
>> I forgot to paste it.
>> ./iperf  -c 192.168.0.2 -P 1 -p 5001 -f G -t 60
>>
>>>  - the number of context switches going down for cpu7 from with/without
>>>    the patch is interesting, but the fact that we don't see this for
>>>    cpu0 makes me doubt that it's meaningful.
>>>
>>
>> Ah, Sorry that I don't get the average numbers. With/without this patch the number of context
>> switches for binding to cpu0 is almost no difference. The numbers are at the same level.
>>
>>> Based purely on the bandwidth, and assuming those numbers are
>>> consistent, I would say we should go ahead and merge this.
>>>
>>
>> Thanks. The numbers of bandwidth are average and consistent.
>>
> So I would say "ship it", Marc?

Yes, this is convincing enough. Shannon, thanks for going the extra
mile, running the tests and sharing the figures.

Applied to -next

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...
_______________________________________________
kvmarm mailing list
kvmarm@xxxxxxxxxxxxxxxxxxxxx
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm