Re: [PATCH v7 04/11] KVM: MMU: zap pages in batch

Xiao Guangrong <xiaoguangrong@xxxxxxxxxxxxxxxxxx> · Wed, 29 May 2013 22:00:08 +0800

On 05/29/2013 09:21 PM, Marcelo Tosatti wrote:
> On Wed, May 29, 2013 at 09:09:09PM +0800, Xiao Guangrong wrote:
>> On 05/29/2013 07:11 PM, Marcelo Tosatti wrote:
>>> On Tue, May 28, 2013 at 11:02:09PM +0800, Xiao Guangrong wrote:
>>>> On 05/28/2013 08:18 AM, Marcelo Tosatti wrote:
>>>>> On Mon, May 27, 2013 at 10:20:12AM +0800, Xiao Guangrong wrote:
>>>>>> On 05/25/2013 04:34 AM, Marcelo Tosatti wrote:
>>>>>>> On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote:
>>>>>>>> Zap at lease 10 pages before releasing mmu-lock to reduce the overload
>>>>>>>> caused by requiring lock
>>>>>>>>
>>>>>>>> After the patch, kvm_zap_obsolete_pages can forward progress anyway,
>>>>>>>> so update the comments
>>>>>>>>
>>>>>>>> [ It improves kernel building 0.6% ~ 1% ]
>>>>>>>
>>>>>>> Can you please describe the overload in more detail? Under what scenario
>>>>>>> is kernel building improved?
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>> The scenario is we do kernel building, meanwhile, repeatedly read PCI rom
>>>>>> every one second.
>>>>>>
>>>>>> [
>>>>>>    echo 1 > /sys/bus/pci/devices/0000\:00\:03.0/rom
>>>>>>    cat /sys/bus/pci/devices/0000\:00\:03.0/rom > /dev/null
>>>>>> ]
>>>>>
>>>>> Can't see why it reflects real world scenario (or a real world
>>>>> scenario with same characteristics regarding kvm_mmu_zap_all vs faults)?
>>>>>
>>>>> Point is, it would be good to understand why this change 
>>>>> is improving performance? What are these cases where breaking out of
>>>>> kvm_mmu_zap_all due to either (need_resched || spin_needbreak) on zapped
>>>>> < 10 ?
>>>>
>>>> When guest read ROM, qemu will set the memory to map the device's firmware,
>>>> that is why kvm_mmu_zap_all can be called in the scenario.
>>>>
>>>> The reasons why it heart the performance are:
>>>> 1): Qemu use a global io-lock to sync all vcpu, so that the io-lock is held
>>>>     when we do kvm_mmu_zap_all(). If kvm_mmu_zap_all() is not efficient, all
>>>>     other vcpus need wait a long time to do I/O.
>>>>
>>>> 2): kvm_mmu_zap_all() is triggered in vcpu context. so it can block the IPI
>>>>     request from other vcpus.
>>>>
>>>> Is it enough?
>>>
>>> That is no problem. The problem is why you chose "10" as the minimum number of
>>> pages to zap before considering reschedule. I would expect the need to
>>
>> Well, my description above explained why batch-zapping is needed - we do
>> not want the vcpu spend lots of time to zap all pages because it hurts other
>> vcpus running.
>>
>> But, why the batch page number is "10"... I can not answer this, i just guessed
>> that '10' can make vcpu do not spend long time on zap_all_pages and do
>> not cause mmu-lock too hungry. "10" is the speculative value and i am not sure
>> it is the best value but at lease, i think it can work.
>>
>>> reschedule to be rare enough that one kvm_mmu_zap_all instance (between
>>> schedule in and schedule out) to be able to release no less than a
>>> thousand pages.
>>
>> Unfortunately, no.
>>
>> This information is I replied Gleb in his mail where he raced a question that
>> why "collapse tlb flush is needed":
>>
>> ======
>> It seems no.
>> Since we have reloaded mmu before zapping the obsolete pages, the mmu-lock
>> is easily contended. I did the simple track:
>>
>> +       int num = 0;
>>  restart:
>>         list_for_each_entry_safe_reverse(sp, node,
>>               &kvm->arch.active_mmu_pages, link) {
>> @@ -4265,6 +4265,7 @@ restart:
>>                 if (batch >= BATCH_ZAP_PAGES &&
>>                       cond_resched_lock(&kvm->mmu_lock)) {
>>                         batch = 0;
>> +                       num++;
>>                         goto restart;
>>                 }
>>
>> @@ -4277,6 +4278,7 @@ restart:
>>          * may use the pages.
>>          */
>>         kvm_mmu_commit_zap_page(kvm, &invalid_list);
>> +       printk("lock-break: %d.\n", num);
>>  }
>>
>> I do read pci rom when doing kernel building in the guest which
>> has 1G memory and 4vcpus with ept enabled, this is the normal
>> workload and normal configuration.
>>
>> # dmesg
>> [ 2338.759099] lock-break: 8.
>> [ 2339.732442] lock-break: 5.
>> [ 2340.904446] lock-break: 3.
>> [ 2342.513514] lock-break: 3.
>> [ 2343.452229] lock-break: 3.
>> [ 2344.981599] lock-break: 4.
>>
>> Basically, we need to break many times.
>> ======
>>
>> You can see we should break 3 times to zap all pages even if we have zapoed
>> 10 pages in batch. It is obviously that it need break more times without
>> batch-zapping.
> 
> Yes, but this is not a real scenario, or even describes a real scenario
> as far as i know. 

Aha.

Okay, maybe "read rom" is not the common case, but vcpu can trigger it or guest
driver can do in the further. What happen if vcpu trigger it? The worst
case is, one vcpu is always break mmu-lock due to one other vcpus doing intense
memory access and the reset vcpus is waiting IO or its IPI. It is easy soft-lockup.

More worse, if host memory is really less and host is always trying to reclaim qemu's
memory, that cause the lock always be hot and can not zap one page before rescheduled.

> 
> Are you sure this minimum-batching-before-considering-reschedule even
> after obsolete pages optimization?

Yes, this track is after all patches in this series are applied.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html