On Wed, May 29, 2013 at 09:09:09PM +0800, Xiao Guangrong wrote: > On 05/29/2013 07:11 PM, Marcelo Tosatti wrote: > > On Tue, May 28, 2013 at 11:02:09PM +0800, Xiao Guangrong wrote: > >> On 05/28/2013 08:18 AM, Marcelo Tosatti wrote: > >>> On Mon, May 27, 2013 at 10:20:12AM +0800, Xiao Guangrong wrote: > >>>> On 05/25/2013 04:34 AM, Marcelo Tosatti wrote: > >>>>> On Thu, May 23, 2013 at 03:55:53AM +0800, Xiao Guangrong wrote: > >>>>>> Zap at lease 10 pages before releasing mmu-lock to reduce the overload > >>>>>> caused by requiring lock > >>>>>> > >>>>>> After the patch, kvm_zap_obsolete_pages can forward progress anyway, > >>>>>> so update the comments > >>>>>> > >>>>>> [ It improves kernel building 0.6% ~ 1% ] > >>>>> > >>>>> Can you please describe the overload in more detail? Under what scenario > >>>>> is kernel building improved? > >>>> > >>>> Yes. > >>>> > >>>> The scenario is we do kernel building, meanwhile, repeatedly read PCI rom > >>>> every one second. > >>>> > >>>> [ > >>>> echo 1 > /sys/bus/pci/devices/0000\:00\:03.0/rom > >>>> cat /sys/bus/pci/devices/0000\:00\:03.0/rom > /dev/null > >>>> ] > >>> > >>> Can't see why it reflects real world scenario (or a real world > >>> scenario with same characteristics regarding kvm_mmu_zap_all vs faults)? > >>> > >>> Point is, it would be good to understand why this change > >>> is improving performance? What are these cases where breaking out of > >>> kvm_mmu_zap_all due to either (need_resched || spin_needbreak) on zapped > >>> < 10 ? > >> > >> When guest read ROM, qemu will set the memory to map the device's firmware, > >> that is why kvm_mmu_zap_all can be called in the scenario. > >> > >> The reasons why it heart the performance are: > >> 1): Qemu use a global io-lock to sync all vcpu, so that the io-lock is held > >> when we do kvm_mmu_zap_all(). If kvm_mmu_zap_all() is not efficient, all > >> other vcpus need wait a long time to do I/O. > >> > >> 2): kvm_mmu_zap_all() is triggered in vcpu context. so it can block the IPI > >> request from other vcpus. > >> > >> Is it enough? > > > > That is no problem. The problem is why you chose "10" as the minimum number of > > pages to zap before considering reschedule. I would expect the need to > > Well, my description above explained why batch-zapping is needed - we do > not want the vcpu spend lots of time to zap all pages because it hurts other > vcpus running. > > But, why the batch page number is "10"... I can not answer this, i just guessed > that '10' can make vcpu do not spend long time on zap_all_pages and do > not cause mmu-lock too hungry. "10" is the speculative value and i am not sure > it is the best value but at lease, i think it can work. > > > reschedule to be rare enough that one kvm_mmu_zap_all instance (between > > schedule in and schedule out) to be able to release no less than a > > thousand pages. > > Unfortunately, no. > > This information is I replied Gleb in his mail where he raced a question that > why "collapse tlb flush is needed": > > ====== > It seems no. > Since we have reloaded mmu before zapping the obsolete pages, the mmu-lock > is easily contended. I did the simple track: > > + int num = 0; > restart: > list_for_each_entry_safe_reverse(sp, node, > &kvm->arch.active_mmu_pages, link) { > @@ -4265,6 +4265,7 @@ restart: > if (batch >= BATCH_ZAP_PAGES && > cond_resched_lock(&kvm->mmu_lock)) { > batch = 0; > + num++; > goto restart; > } > > @@ -4277,6 +4278,7 @@ restart: > * may use the pages. > */ > kvm_mmu_commit_zap_page(kvm, &invalid_list); > + printk("lock-break: %d.\n", num); > } > > I do read pci rom when doing kernel building in the guest which > has 1G memory and 4vcpus with ept enabled, this is the normal > workload and normal configuration. > > # dmesg > [ 2338.759099] lock-break: 8. > [ 2339.732442] lock-break: 5. > [ 2340.904446] lock-break: 3. > [ 2342.513514] lock-break: 3. > [ 2343.452229] lock-break: 3. > [ 2344.981599] lock-break: 4. > > Basically, we need to break many times. > ====== > > You can see we should break 3 times to zap all pages even if we have zapoed > 10 pages in batch. It is obviously that it need break more times without > batch-zapping. Yes, but this is not a real scenario, or even describes a real scenario as far as i know. Are you sure this minimum-batching-before-considering-reschedule even after obsolete pages optimization? I fail to see why. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html