Re: [PATCH v3] arm/arm64: KVM: vgic: kick the specific vcpu instead of iterating through all

Shannon Zhao <zhaoshenglong@xxxxxxxxxx> · Wed, 26 Nov 2014 16:31:44 +0800

On 2014/11/26 15:24, Christoffer Dall wrote:
> On Wed, Nov 26, 2014 at 02:13:32PM +0800, Shannon Zhao wrote:
>> On 2014/11/25 19:55, Marc Zyngier wrote:
>>> On 25/11/14 11:49, Christoffer Dall wrote:
>>>> On Tue, Nov 25, 2014 at 11:24:43AM +0000, Marc Zyngier wrote:
>>>>> On 25/11/14 11:11, Christoffer Dall wrote:
>>>>>> On Tue, Nov 25, 2014 at 10:54:18AM +0800, Shannon Zhao wrote:
>>>>>>> On 2014/11/24 18:53, Christoffer Dall wrote:
>>>>>>>> On Mon, Nov 24, 2014 at 03:53:16PM +0800, Shannon Zhao wrote:
>>>>>>>>> Hi Marc, Christoffer,
>>>>>>>>>
>>>>>>>>> On 2014/11/23 4:04, Christoffer Dall wrote:
>>>>>>>>>> On Wed, Nov 19, 2014 at 06:11:25PM +0800, Shannon Zhao wrote:
>>>>>>>>>>> When call kvm_vgic_inject_irq to inject interrupt, we can known which
>>>>>>>>>>> vcpu the interrupt for by the irq_num and the cpuid. So we should just
>>>>>>>>>>> kick this vcpu to avoid iterating through all.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Shannon Zhao <zhaoshenglong@xxxxxxxxxx>
>>>>>>>>>>
>>>>>>>>>> This looks reasonable to me:
>>>>>>>>>>
>>>>>>>>>> Reviewed-by: Christoffer Dall <christoffer.dall@xxxxxxxxxx>
>>>>>>>>>>
>>>>>>>>>> But as Marc said, we have to consider the churn by introducing more
>>>>>>>>>> changes to the vgic (that file is being hammered pretty intensely
>>>>>>>>>> these days), so if you feel this is an urgent optimization, it would
>>>>>>>>>> be useful to see some data backing this up.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Today I have a test which measures the cycles about kvm_vgic_inject_irq by PMU.
>>>>>>>>> I just test the cycles of SPI using virtio-net.
>>>>>>>>> Test steps:
>>>>>>>>> 1) start a VM with 8 VCPUs
>>>>>>>>> 2) In guest bind the irq of virtio to CPU8, host ping VM, get the cycles
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The test shows:
>>>>>>>>> Without this patch, the cycles is about 3700(3300-5000), and with this patch, the cycles is about 3000(2500-3200).
>>>>>>>>> From this test, I think this patch can bring some improvements.
>>>>>>>>
>>>>>>>> Are these averaged numbers?
>>>>>>>>
>>>>>>>
>>>>>>> Yes:-)
>>>>>>>
>>>>>>>>>
>>>>>>>>> The test code like below. As it's almost no difference about vgic_update_irq_state
>>>>>>>>> between with and without this patch. So just measure the kick's cycles.
>>>>>>>>>
>>>>>>>>> int kvm_vgic_inject_irq(struct kvm *kvm, int cpuid, unsigned int irq_num,
>>>>>>>>>                         bool level)
>>>>>>>>> {
>>>>>>>>>         unsigned long cycles_1,cycles_2;
>>>>>>>>>         if (likely(vgic_initialized(kvm)) &&
>>>>>>>>>             vgic_update_irq_pending(kvm, cpuid, irq_num, level)) {
>>>>>>>>>                 start_pmu();
>>>>>>>>>                 __asm__ __volatile__("MRS %0, PMCCNTR_EL0" : "=r"(cycles_1));
>>>>>>>>>                 vgic_kick_vcpus(kvm);
>>>>>>>>>                 __asm__ __volatile__("MRS %0, PMCCNTR_EL0" : "=r"(cycles_2));
>>>>>>>>>         }
>>>>>>>>>
>>>>>>>>>         return 0;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> int kvm_vgic_inject_irq(struct kvm *kvm, int cpuid, unsigned int irq_num,
>>>>>>>>>                         bool level)
>>>>>>>>> {
>>>>>>>>>         int vcpu_id;
>>>>>>>>>         unsigned long cycles_a,cycles_b;
>>>>>>>>>         if (likely(vgic_initialized(kvm))) {
>>>>>>>>>                 vcpu_id = vgic_update_irq_pending(kvm, cpuid, irq_num, level);
>>>>>>>>>                 if (vcpu_id >= 0) {
>>>>>>>>>                         start_pmu();
>>>>>>>>>                         __asm__ __volatile__("MRS %0, PMCCNTR_EL0" : "=r"(cycles_a));
>>>>>>>>>                         /* kick the specified vcpu */
>>>>>>>>>                         kvm_vcpu_kick(kvm_get_vcpu(kvm, vcpu_id));
>>>>>>>>>                         __asm__ __volatile__("MRS %0, PMCCNTR_EL0" : "=r"(cycles_b));
>>>>>>>>>                 }
>>>>>>>>>         }
>>>>>>>>>         return 0;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>
>>>>>>>> Can you run some IPI-intensive benchmark in your guest and let us know
>>>>>>>> if you see improvements on that level?
>>>>>>>>
>>>>>>>
>>>>>>> Cool, I'll try to find some benchmarks and run. Are there some IPI-intensive benchmarks you suggest?
>>>>>>>
>>>>>>
>>>>>> Hackbench with processes sure seems to like IPIs.
>>>>>
>>>>> But that'd mostly be IPIs in the guest, and we're hoping for that patch
>>>>> to result in a reduction in the number of IPIs on the host when
>>>>> interrupts are injected.
>>>>
>>>> Ah right, I remembered the SGI handling register calling
>>>> kvm_vgic_inject_irq(), but it doesn't.
>>>>
>>>>>
>>>>> I guess that having a workload that generates many interrupts on a SMP
>>>>> guest should result in a reduction of the number of IPIs on the host.
>>>>>
>>>>> What do you think?
>>>>>
>>>> That's sort of what Shannon did already, only we need to measure a drop
>>>> in overall cpu utilization on the host instead or look at iperf numbers
>>>> or something like that. Right?
>>>
>>> Yes. I guess that vmstat running in the background on the host should
>>> give a good indication of what is going on.
>>>
>>
>> I just measure the overall cpu utilization on the host using vmstat and iperf.
>> I start a VM with 8 vcpus and use iperf to send packets from host to guest.
>> Bind the interrupt of virtio to cpu0 and cpu7.
>>
>> The result is following:
>>
>> Without this patch:
>> Bind to cpu0:
>> 	Bandwidth : 6.60 Gbits/sec
>> 	vmstat data:
>> 	procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>> 	 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>> 	 2  0      0 7795456      0 120568    0    0     0     0 8967 11405  2  2 96  0
>> Bind to cpu7:
>> 	Bandwidth : 6.13 Gbits/sec
>> 	vmstat data:
>> 	procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>> 	 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>> 	 1  0      0 7795016      0 120572    0    0     0     0 14633 20710  2  3 95  0
>>
>> With this patch:
>> Bind to cpu0:
>> 	Bandwidth : 6.99 Gbits/sec
>> 	vmstat data:
>> 	procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>> 	 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>> 	 1  0      0 7788048      0 124836    0    0     0     0 10149 11593  2  2 96  0
>> Bind to cpu7:
>> 	Bandwidth : 6.53 Gbits/sec
>> 	vmstat data:
>> 	procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>> 	 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>> 	 1  0      0 7791044      0 124832    0    0     0     0 11408 14179  2  2 96  0
>>
>> From the data, it has some improvement :-)
>>
> Indeed things are promising, but I have some questions:
> 
>  - Are these numbers consistent (average over how many runs, what's the
>    variance?)
>  - Why the seemlingly consistent difference between binding to cpu0 and
>    cpu7?

The vmstat command line is "vmstat 1 1000". And I just paste the common numbers not the average.
The difference between with and without this patch when binding to cpu7 is obvious. But the
difference of binding to cpu0 is not.

>  - what were the iperf parameters?

I forgot to paste it.
./iperf  -c 192.168.0.2 -P 1 -p 5001 -f G -t 60

>  - the number of context switches going down for cpu7 from with/without
>    the patch is interesting, but the fact that we don't see this for
>    cpu0 makes me doubt that it's meaningful.
> 

Ah, Sorry that I don't get the average numbers. With/without this patch the number of context
switches for binding to cpu0 is almost no difference. The numbers are at the same level.

> Based purely on the bandwidth, and assuming those numbers are
> consistent, I would say we should go ahead and merge this.
> 

Thanks. The numbers of bandwidth are average and consistent.

Shannon

_______________________________________________
kvmarm mailing list
kvmarm@xxxxxxxxxxxxxxxxxxxxx
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm