Do you have any way to determine what CPU groups the different VMs are running on? If you end up in an overcommit situation where half the 'virtual' cpus are on one AMD socket, and the other half are on a different AMD socket, then you'll be thrashing the hypertransport link. At Cray we were very carefull to never overcommit runnable processes to CPUS, and generally locked processes to a single cpu. Have a read of http://berrange.com/posts/2010/02/12/controlling-guest-cpu-numa-affinity-in-libvirt-with-qemu-kvm-xen/ I'm going to speculate that when things don't work very well you end up with memory from a booting guest scattered across many different NUMA nodes/cpus, and then it really won't matter how good the spin loop/scheduler code is because you are bound by the additional latency and bandwidth limitations of running on one socekt and accessing half the memory that's resident on a different socket. On Tue, Aug 21, 2012 at 04:21:07PM +0100, Richard Davies wrote: > Avi Kivity wrote: > > Richard Davies wrote: > > > We're running host kernel 3.5.1 and qemu-kvm 1.1.1. > > > > > > I hadn't though about it, but I agree this is related to cpu overcommit. The > > > slow boots are intermittent (and infrequent) with cpu overcommit whereas I > > > don't think it occurs without cpu overcommit. > > > > > > In addition, if there is a slow boot ongoing, and you kill some other VMs to > > > reduce cpu overcommit then this will sometimes speed it up. > > > > > > I guess the question is why even with overcommit most boots are fine, but > > > some small fraction then go slow? > > > > Could be a bug. The scheduler and the spin-loop handling code fight > > each other instead of working well. > > > > Please provide snapshots of 'perf top' while a slow boot is in progress. > > Below are two 'perf top' snapshots during a slow boot, which appear to me to > support your idea of a spin-lock problem. > > There are a lot more "unprocessable samples recorded" messages at the end of > each snapshot which I haven't included. I think these may be from the guest > OS - the kernel is listed, and qemu-kvm itself is listed on some other > traces which I did, although not these. > > Richard. > > > > PerfTop: 62249 irqs/sec kernel:96.9% exact: 0.0% [4000Hz cycles], (all, 16 CPUs) > -------------------------------------------------------------------------------------------------------------------------------- > > 35.80% [kernel] [k] _raw_spin_lock_irqsave > 21.64% [kernel] [k] isolate_freepages_block > 5.91% [kernel] [k] yield_to > 4.95% [kernel] [k] _raw_spin_lock > 3.37% [kernel] [k] kvm_vcpu_on_spin > 2.74% [kernel] [k] add_preempt_count > 2.45% [kernel] [k] _raw_spin_unlock > 2.33% [kernel] [k] sub_preempt_count > 2.18% [kernel] [k] svm_vcpu_run > 2.17% [kernel] [k] kvm_vcpu_yield_to > 1.89% [kernel] [k] memcmp > 1.50% [kernel] [k] get_pid_task > 1.26% [kernel] [k] kvm_arch_vcpu_ioctl_run > 1.16% [kernel] [k] pid_task > 0.70% [kernel] [k] rcu_note_context_switch > 0.70% [kernel] [k] trace_hardirqs_on > 0.52% [kernel] [k] __rcu_read_unlock > 0.51% [kernel] [k] trace_preempt_on > 0.47% [kernel] [k] __srcu_read_lock > 0.43% [kernel] [k] get_parent_ip > 0.42% [kernel] [k] get_pageblock_flags_group > 0.38% [kernel] [k] in_lock_functions > 0.34% [kernel] [k] trace_preempt_off > 0.34% [kernel] [k] trace_hardirqs_off > 0.29% [kernel] [k] clear_page_c > 0.23% [kernel] [k] __srcu_read_unlock > 0.20% [kernel] [k] __rcu_read_lock > 0.14% [kernel] [k] handle_exit > 0.11% libc-2.10.1.so [.] strcmp > 0.11% [kernel] [k] _raw_spin_unlock_irqrestore > 0.11% [kernel] [k] _raw_spin_lock_irq > 0.11% [kernel] [k] find_highest_vector > 0.09% [kernel] [k] ktime_get > 0.08% [kernel] [k] copy_page_c > 0.08% [kernel] [k] pause_interception > 0.08% [kernel] [k] kmem_cache_alloc > 0.08% [kernel] [k] resched_task > 0.08% perf [.] dso__find_symbol > 0.06% [kernel] [k] compaction_alloc > 0.06% libc-2.10.1.so [.] 0x0000000000076dab > 0.06% [kernel] [k] read_tsc > 0.06% perf [.] add_hist_entry > 0.05% [kernel] [k] svm_read_l1_tsc > 0.05% [kernel] [k] native_read_tsc > 0.05% perf [.] sort__dso_cmp > 0.05% [kernel] [k] copy_user_generic_string > 0.05% [kernel] [k] ktime_get_update_offsets > 0.04% [kernel] [k] kvm_check_async_pf_completion > 0.04% [kernel] [k] __schedule > 0.04% [kernel] [k] __rcu_pending > 0.04% [kernel] [k] svm_complete_interrupts > 0.04% [kernel] [k] perf_pmu_disable > 0.04% [kernel] [k] isolate_migratepages_range > 0.04% [kernel] [k] sched_clock_cpu > 0.04% [kernel] [k] kvm_cpu_has_pending_timer > 0.04% [kernel] [k] apic_timer_interrupt > 0.04% [vdso] [.] 0x00007fff2e1ff607 > 0.04% [kernel] [k] apic_update_ppr > 0.04% [kernel] [k] do_select > 0.04% [kernel] [k] svm_scale_tsc > 0.04% [kernel] [k] system_call_after_swapgs > 0.03% [kernel] [k] kvm_lapic_get_cr8 > 0.03% perf [.] sort__sym_cmp > 0.03% [kernel] [k] find_next_bit > 0.03% [kernel] [k] kvm_set_cr8 > 0.03% [kernel] [k] rcu_check_callbacks > 9750 unprocessable samples recorded.9751 unprocessable samples recorded.9752 unprocessable samples recorded.9753 unprocessable samples recorded.9754 unprocessable samples recorded.9755 unprocessable samples recorded.9756 unprocessable samples recorded.9757 u nprocessable samples recorded.9758 unprocessable samples recorded.9759 unprocessable samples recorded.9760 unprocessable samples recorded.9761 unprocessable samples recorded.9762 unprocessable samples recorded.9763 unprocessable samples recorded. > > > > PerfTop: 61584 irqs/sec kernel:97.4% exact: 0.0% [4000Hz cycles], (all, 16 CPUs) > -------------------------------------------------------------------------------------------------------------------------------- > > 36.73% [kernel] [k] _raw_spin_lock_irqsave > 19.00% [kernel] [k] isolate_freepages_block > 5.80% [kernel] [k] yield_to > 5.23% [kernel] [k] _raw_spin_lock > 3.97% [kernel] [k] kvm_vcpu_on_spin > 2.98% [kernel] [k] add_preempt_count > 2.45% [kernel] [k] sub_preempt_count > 2.37% [kernel] [k] _raw_spin_unlock > 2.22% [kernel] [k] svm_vcpu_run > 2.19% [kernel] [k] kvm_vcpu_yield_to > 1.90% [kernel] [k] memcmp > 1.54% [kernel] [k] get_pid_task > 1.39% [kernel] [k] kvm_arch_vcpu_ioctl_run > 1.30% [kernel] [k] pid_task > 0.75% [kernel] [k] rcu_note_context_switch > 0.74% [kernel] [k] trace_hardirqs_on > 0.58% [kernel] [k] __rcu_read_unlock > 0.55% [kernel] [k] trace_preempt_on > 0.47% [kernel] [k] __srcu_read_lock > 0.44% [kernel] [k] get_parent_ip > 0.41% [kernel] [k] clear_page_c > 0.40% [kernel] [k] get_pageblock_flags_group > 0.39% [kernel] [k] in_lock_functions > 0.36% [kernel] [k] trace_preempt_off > 0.35% [kernel] [k] trace_hardirqs_off > 0.23% [kernel] [k] __srcu_read_unlock > 0.20% [kernel] [k] __rcu_read_lock > 0.15% [kernel] [k] _raw_spin_lock_irq > 0.14% [kernel] [k] handle_exit > 0.12% [kernel] [k] find_highest_vector > 0.11% [kernel] [k] resched_task > 0.10% libc-2.10.1.so [.] strcmp > 0.09% [kernel] [k] _raw_spin_unlock_irqrestore > 0.09% [kernel] [k] ktime_get > 0.08% [kernel] [k] pause_interception > 0.08% [kernel] [k] copy_page_c > 0.07% [kernel] [k] __schedule > 0.07% [kernel] [k] compact_zone > 0.07% perf [.] dso__find_symbol > 0.06% perf [.] add_hist_entry > 0.06% [kernel] [k] read_tsc > 0.06% [kernel] [k] svm_read_l1_tsc > 0.05% [kernel] [k] native_read_tsc > 0.05% [kernel] [k] ktime_get_update_offsets > 0.05% [kernel] [k] compaction_alloc > 0.05% libc-2.10.1.so [.] 0x0000000000073ae0 > 0.05% [kernel] [k] kmem_cache_alloc > 0.05% [kernel] [k] svm_complete_interrupts > 0.05% [kernel] [k] kvm_check_async_pf_completion > 0.05% [kernel] [k] apic_timer_interrupt > 0.05% perf [.] sort__dso_cmp > 0.05% [kernel] [k] kvm_cpu_has_pending_timer > 0.04% [kernel] [k] svm_scale_tsc > 0.04% [kernel] [k] isolate_migratepages_range > 0.04% [kernel] [k] sched_clock_cpu > 0.04% [kernel] [k] __rcu_pending > 0.04% [kernel] [k] apic_update_ppr > 0.04% [kernel] [k] do_select > 0.04% [kernel] [k] perf_pmu_disable > 0.04% [kernel] [k] kvm_set_cr8 > 0.04% [kernel] [k] update_curr > 0.04% [kernel] [k] reschedule_interrupt > 0.03% [kernel] [k] kvm_lapic_get_cr8 > 0.03% libc-2.10.1.so [.] strstr > 0.03% [kernel] [k] apic_has_pending_timer > 0.03% perf [.] sort__sym_cmp > 4963 unprocessable samples recorded.4964 unprocessable samples recorded.4965 unprocessable samples recorded.4966 unprocessable samples recorded.4967 unprocessable samples recorded.4968 unprocessable samples recorded.4969 unprocessable samples recorded.4970 unprocessable samples recorded.4971 unprocessable samples recorded.4972 unprocessable samples recorded.4973 unprocessable samples recorded.4974 unprocessable samples recorded.4975 unprocessable samples recorded. > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html