Re: [PATCH v3] arm/arm64: KVM: vgic: kick the specific vcpu instead of iterating through all

Shannon Zhao <zhaoshenglong@xxxxxxxxxx> · Thu, 27 Nov 2014 09:56:58 +0800

On 2014/11/26 18:22, Marc Zyngier wrote:
> On 26/11/14 09:16, Christoffer Dall wrote:
>> On Wed, Nov 26, 2014 at 04:31:44PM +0800, Shannon Zhao wrote:
>>> On 2014/11/26 15:24, Christoffer Dall wrote:
>>>> On Wed, Nov 26, 2014 at 02:13:32PM +0800, Shannon Zhao wrote:
>>>>> On 2014/11/25 19:55, Marc Zyngier wrote:
>>>>>> On 25/11/14 11:49, Christoffer Dall wrote:
>>>>>>> On Tue, Nov 25, 2014 at 11:24:43AM +0000, Marc Zyngier wrote:
>>>>>>>> On 25/11/14 11:11, Christoffer Dall wrote:
>>>>>>>>> On Tue, Nov 25, 2014 at 10:54:18AM +0800, Shannon Zhao wrote:
>>>>>>>>>> On 2014/11/24 18:53, Christoffer Dall wrote:
>>>>>>>>>>> On Mon, Nov 24, 2014 at 03:53:16PM +0800, Shannon Zhao wrote:
>>>>>>>>>>>> Hi Marc, Christoffer,
>>>>>>>>>>>>
>>>>>>>>>>>> On 2014/11/23 4:04, Christoffer Dall wrote:
>>>>>>>>>>>>> On Wed, Nov 19, 2014 at 06:11:25PM +0800, Shannon Zhao wrote:
>>>>>>>>>>>>>> When call kvm_vgic_inject_irq to inject interrupt, we can known which
>>>>>>>>>>>>>> vcpu the interrupt for by the irq_num and the cpuid. So we should just
>>>>>>>>>>>>>> kick this vcpu to avoid iterating through all.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Shannon Zhao <zhaoshenglong@xxxxxxxxxx>
>>>>>>>>>>>>>
>>>>>>>>>>>>> This looks reasonable to me:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Reviewed-by: Christoffer Dall <christoffer.dall@xxxxxxxxxx>
>>>>>>>>>>>>>
>>>>>>>>>>>>> But as Marc said, we have to consider the churn by introducing more
>>>>>>>>>>>>> changes to the vgic (that file is being hammered pretty intensely
>>>>>>>>>>>>> these days), so if you feel this is an urgent optimization, it would
>>>>>>>>>>>>> be useful to see some data backing this up.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Today I have a test which measures the cycles about kvm_vgic_inject_irq by PMU.
>>>>>>>>>>>> I just test the cycles of SPI using virtio-net.
>>>>>>>>>>>> Test steps:
>>>>>>>>>>>> 1) start a VM with 8 VCPUs
>>>>>>>>>>>> 2) In guest bind the irq of virtio to CPU8, host ping VM, get the cycles
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The test shows:
>>>>>>>>>>>> Without this patch, the cycles is about 3700(3300-5000), and with this patch, the cycles is about 3000(2500-3200).
>>>>>>>>>>>> From this test, I think this patch can bring some improvements.
>>>>>>>>>>>
>>>>>>>>>>> Are these averaged numbers?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes:-)
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The test code like below. As it's almost no difference about vgic_update_irq_state
>>>>>>>>>>>> between with and without this patch. So just measure the kick's cycles.
>>>>>>>>>>>>
>>>>>>>>>>>> int kvm_vgic_inject_irq(struct kvm *kvm, int cpuid, unsigned int irq_num,
>>>>>>>>>>>>                         bool level)
>>>>>>>>>>>> {
>>>>>>>>>>>>         unsigned long cycles_1,cycles_2;
>>>>>>>>>>>>         if (likely(vgic_initialized(kvm)) &&
>>>>>>>>>>>>             vgic_update_irq_pending(kvm, cpuid, irq_num, level)) {
>>>>>>>>>>>>                 start_pmu();
>>>>>>>>>>>>                 __asm__ __volatile__("MRS %0, PMCCNTR_EL0" : "=r"(cycles_1));
>>>>>>>>>>>>                 vgic_kick_vcpus(kvm);
>>>>>>>>>>>>                 __asm__ __volatile__("MRS %0, PMCCNTR_EL0" : "=r"(cycles_2));
>>>>>>>>>>>>         }
>>>>>>>>>>>>
>>>>>>>>>>>>         return 0;
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> int kvm_vgic_inject_irq(struct kvm *kvm, int cpuid, unsigned int irq_num,
>>>>>>>>>>>>                         bool level)
>>>>>>>>>>>> {
>>>>>>>>>>>>         int vcpu_id;
>>>>>>>>>>>>         unsigned long cycles_a,cycles_b;
>>>>>>>>>>>>         if (likely(vgic_initialized(kvm))) {
>>>>>>>>>>>>                 vcpu_id = vgic_update_irq_pending(kvm, cpuid, irq_num, level);
>>>>>>>>>>>>                 if (vcpu_id >= 0) {
>>>>>>>>>>>>                         start_pmu();
>>>>>>>>>>>>                         __asm__ __volatile__("MRS %0, PMCCNTR_EL0" : "=r"(cycles_a));
>>>>>>>>>>>>                         /* kick the specified vcpu */
>>>>>>>>>>>>                         kvm_vcpu_kick(kvm_get_vcpu(kvm, vcpu_id));
>>>>>>>>>>>>                         __asm__ __volatile__("MRS %0, PMCCNTR_EL0" : "=r"(cycles_b));
>>>>>>>>>>>>                 }
>>>>>>>>>>>>         }
>>>>>>>>>>>>         return 0;
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Can you run some IPI-intensive benchmark in your guest and let us know
>>>>>>>>>>> if you see improvements on that level?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Cool, I'll try to find some benchmarks and run. Are there some IPI-intensive benchmarks you suggest?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hackbench with processes sure seems to like IPIs.
>>>>>>>>
>>>>>>>> But that'd mostly be IPIs in the guest, and we're hoping for that patch
>>>>>>>> to result in a reduction in the number of IPIs on the host when
>>>>>>>> interrupts are injected.
>>>>>>>
>>>>>>> Ah right, I remembered the SGI handling register calling
>>>>>>> kvm_vgic_inject_irq(), but it doesn't.
>>>>>>>
>>>>>>>>
>>>>>>>> I guess that having a workload that generates many interrupts on a SMP
>>>>>>>> guest should result in a reduction of the number of IPIs on the host.
>>>>>>>>
>>>>>>>> What do you think?
>>>>>>>>
>>>>>>> That's sort of what Shannon did already, only we need to measure a drop
>>>>>>> in overall cpu utilization on the host instead or look at iperf numbers
>>>>>>> or something like that. Right?
>>>>>>
>>>>>> Yes. I guess that vmstat running in the background on the host should
>>>>>> give a good indication of what is going on.
>>>>>>
>>>>>
>>>>> I just measure the overall cpu utilization on the host using vmstat and iperf.
>>>>> I start a VM with 8 vcpus and use iperf to send packets from host to guest.
>>>>> Bind the interrupt of virtio to cpu0 and cpu7.
>>>>>
>>>>> The result is following:
>>>>>
>>>>> Without this patch:
>>>>> Bind to cpu0:
>>>>> 	Bandwidth : 6.60 Gbits/sec
>>>>> 	vmstat data:
>>>>> 	procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>>>>> 	 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>>>>> 	 2  0      0 7795456      0 120568    0    0     0     0 8967 11405  2  2 96  0
>>>>> Bind to cpu7:
>>>>> 	Bandwidth : 6.13 Gbits/sec
>>>>> 	vmstat data:
>>>>> 	procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>>>>> 	 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>>>>> 	 1  0      0 7795016      0 120572    0    0     0     0 14633 20710  2  3 95  0
>>>>>
>>>>> With this patch:
>>>>> Bind to cpu0:
>>>>> 	Bandwidth : 6.99 Gbits/sec
>>>>> 	vmstat data:
>>>>> 	procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>>>>> 	 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>>>>> 	 1  0      0 7788048      0 124836    0    0     0     0 10149 11593  2  2 96  0
>>>>> Bind to cpu7:
>>>>> 	Bandwidth : 6.53 Gbits/sec
>>>>> 	vmstat data:
>>>>> 	procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>>>>> 	 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>>>>> 	 1  0      0 7791044      0 124832    0    0     0     0 11408 14179  2  2 96  0
>>>>>
>>>>> From the data, it has some improvement :-)
>>>>>
>>>> Indeed things are promising, but I have some questions:
>>>>
>>>>  - Are these numbers consistent (average over how many runs, what's the
>>>>    variance?)
>>>>  - Why the seemlingly consistent difference between binding to cpu0 and
>>>>    cpu7?
>>>
>>> The vmstat command line is "vmstat 1 1000". And I just paste the common numbers not the average.
>>> The difference between with and without this patch when binding to cpu7 is obvious. But the
>>> difference of binding to cpu0 is not.
>>>
>>>>  - what were the iperf parameters?
>>>
>>> I forgot to paste it.
>>> ./iperf  -c 192.168.0.2 -P 1 -p 5001 -f G -t 60
>>>
>>>>  - the number of context switches going down for cpu7 from with/without
>>>>    the patch is interesting, but the fact that we don't see this for
>>>>    cpu0 makes me doubt that it's meaningful.
>>>>
>>>
>>> Ah, Sorry that I don't get the average numbers. With/without this patch the number of context
>>> switches for binding to cpu0 is almost no difference. The numbers are at the same level.
>>>
>>>> Based purely on the bandwidth, and assuming those numbers are
>>>> consistent, I would say we should go ahead and merge this.
>>>>
>>>
>>> Thanks. The numbers of bandwidth are average and consistent.
>>>
>> So I would say "ship it", Marc?
> 
> Yes, this is convincing enough. Shannon, thanks for going the extra
> mile, running the tests and sharing the figures.
> 
> Applied to -next
> 

Marc, Christoffer,

Thanks for your help and patience.

Shannon

_______________________________________________
kvmarm mailing list
kvmarm@xxxxxxxxxxxxxxxxxxxxx
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm