On 05/03/2016 05:09 PM, Radim Krčmář wrote: > 2016-05-03 14:37+0200, Christian Borntraeger: >> Some wakeups should not be considered a sucessful poll. For example on >> s390 I/O interrupts are usually floating, which means that _ALL_ CPUs >> would be considered runnable - letting all vCPUs poll all the time for >> transactional like workload, even if one vCPU would be enough. >> This can result in huge CPU usage for large guests. >> This patch lets architectures provide a way to qualify wakeups if they >> should be considered a good/bad wakeups in regard to polls. >> >> For s390 the implementation will fence of halt polling for anything but >> known good, single vCPU events. The s390 implementation for floating >> interrupts does a wakeup for one vCPU, but the interrupt will be delivered >> by whatever CPU checks first for a pending interrupt. We prefer the >> woken up CPU by marking the poll of this CPU as "good" poll. >> This code will also mark several other wakeup reasons like IPI or >> expired timers as "good". This will of course also mark some events as >> not sucessful. As KVM on z runs always as a 2nd level hypervisor, >> we prefer to not poll, unless we are really sure, though. >> >> This patch successfully limits the CPU usage for cases like uperf 1byte >> transactional ping pong workload or wakeup heavy workload like OLTP >> while still providing a proper speedup. >> >> This also introduced a new vcpu stat "halt_poll_no_tuning" that marks >> wakeups that are considered not good for polling. >> >> Signed-off-by: Christian Borntraeger <borntraeger@xxxxxxxxxx> >> Cc: David Matlack <dmatlack@xxxxxxxxxx> >> Cc: Wanpeng Li <kernellwp@xxxxxxxxx> >> --- > > Thanks for all explanations, > > Acked-by: Radim Krčmář <rkrcmar@xxxxxxxxxx> > The feedback about the logic triggered some more experiments on my side. So I was experimenting with some different workloads/heuristics and it seems that even more aggressive shrinking (basically resetting to 0 as soon as an invalid poll comes along) does improve the cpu usage even more. patch on top diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index ffe0545..c168662 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2036,12 +2036,13 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu) out: block_ns = ktime_to_ns(cur) - ktime_to_ns(start); - if (halt_poll_ns) { + if (!vcpu_valid_wakeup(vcpu)) + shrink_halt_poll_ns(vcpu); + else if (halt_poll_ns) { if (block_ns <= vcpu->halt_poll_ns) ; /* we had a long block, shrink polling */ - else if (!vcpu_valid_wakeup(vcpu) || - (vcpu->halt_poll_ns && block_ns > halt_poll_ns)) + else if (vcpu->halt_poll_ns && block_ns > halt_poll_ns) shrink_halt_poll_ns(vcpu); /* we had a short halt and our poll time is too small */ else if (vcpu->halt_poll_ns < halt_poll_ns && the uperf 1byte:1byte workload seems to have all the benefits still. I have asked the performance folks to test several other workloads if we loose some of the benefits. So I will defer this patch until I have a full picture which heuristics is best. Hopefully I have some answers next week. (So the new diff looks like) @@ -2034,7 +2036,9 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu) out: block_ns = ktime_to_ns(cur) - ktime_to_ns(start); - if (halt_poll_ns) { + if (!vcpu_valid_wakeup(vcpu)) + shrink_halt_poll_ns(vcpu); + else if (halt_poll_ns) { if (block_ns <= vcpu->halt_poll_ns) ; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html