On 6/26/2013 6:40 AM, Raghavendra K T
wrote:
On 06/26/2013 06:22 PM, Gleb Natapov wrote:
On Wed, Jun 26, 2013 at 01:37:45PM +0200,
Andrew Jones wrote:
On Wed, Jun 26, 2013 at 02:15:26PM
+0530, Raghavendra K T wrote:
On 06/25/2013 08:20 PM, Andrew Theurer
wrote:
On Sun, 2013-06-02 at 00:51 +0530,
Raghavendra K T wrote:
This series replaces the existing
paravirtualized spinlock mechanism
with a paravirtualized ticketlock mechanism. The series
provides
implementation for both Xen and KVM.
Changes in V9:
- Changed spin_threshold to 32k to avoid excess halt
exits that are
causing undercommit degradation (after PLE handler
improvement).
- Added kvm_irq_delivery_to_apic (suggested by Gleb)
- Optimized halt exit path to use PLE handler
V8 of PVspinlock was posted last year. After Avi's
suggestions to look
at PLE handler's improvements, various optimizations in
PLE handling
have been tried.
Sorry for not posting this sooner. I have tested the v9
pv-ticketlock
patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu
VMs. I have
tested these patches with and without PLE, as PLE is still
not scalable
with large VMs.
Hi Andrew,
Thanks for testing.
System: x3850X5, 40 cores, 80
threads
1x over-commit with 10-vCPU VMs (8 VMs) all running
dbench:
----------------------------------------------------------
Total
Configuration Throughput(MB/s) Notes
3.10-default-ple_on 22945 5% CPU in
host kernel, 2% spin_lock in guests
3.10-default-ple_off 23184 5% CPU in
host kernel, 2% spin_lock in guests
3.10-pvticket-ple_on 22895 5% CPU in
host kernel, 2% spin_lock in guests
3.10-pvticket-ple_off 23051 5% CPU
in host kernel, 2% spin_lock in guests
[all 1x results look good here]
Yes. The 1x results look too close
2x over-commit with 10-vCPU VMs (16 VMs) all running
dbench:
-----------------------------------------------------------
Total
Configuration Throughput Notes
3.10-default-ple_on 6287 55% CPU
host kernel, 17% spin_lock in guests
3.10-default-ple_off 1849 2% CPU in
host kernel, 95% spin_lock in guests
3.10-pvticket-ple_on 6691 50% CPU
in host kernel, 15% spin_lock in guests
3.10-pvticket-ple_off 16464 8% CPU
in host kernel, 33% spin_lock in guests
I see 6.426% improvement with ple_on
and 161.87% improvement with ple_off. I think this is a very
good sign
for the patches
[PLE hinders pv-ticket improvements,
but even with PLE off,
we still off from ideal throughput (somewhere
>20000)]
Okay, The ideal throughput you are referring is getting
around atleast
80% of 1x throughput for over-commit. Yes we are still far
away from
there.
1x over-commit with 20-vCPU VMs (4 VMs) all running
dbench:
----------------------------------------------------------
Total
Configuration Throughput Notes
3.10-default-ple_on 22736 6% CPU in
host kernel, 3% spin_lock in guests
3.10-default-ple_off 23377 5% CPU in
host kernel, 3% spin_lock in guests
3.10-pvticket-ple_on 22471 6% CPU in
host kernel, 3% spin_lock in guests
3.10-pvticket-ple_off 23445 5% CPU
in host kernel, 3% spin_lock in guests
[1x looking fine here]
I see ple_off is little better here.
2x over-commit with 20-vCPU VMs (8 VMs) all running
dbench:
----------------------------------------------------------
Total
Configuration Throughput Notes
3.10-default-ple_on 1965 70% CPU in
host kernel, 34% spin_lock in guests
3.10-default-ple_off 226 2% CPU in
host kernel, 94% spin_lock in guests
3.10-pvticket-ple_on 1942 70% CPU
in host kernel, 35% spin_lock in guests
3.10-pvticket-ple_off 8003 11% CPU
in host kernel, 70% spin_lock in guests
[quite bad all around, but pv-tickets with PLE off the
best so far.
Still quite a bit off from ideal throughput]
This is again a remarkable improvement (307%).
This motivates me to add a patch to disable ple when
pvspinlock is on.
probably we can add a hypercall that disables ple in kvm
init patch.
but only problem I see is what if the guests are mixed.
(i.e one guest has pvspinlock support but other does not.
Host
supports pv)
How about reintroducing the idea to create per-kvm
ple_gap,ple_window
state. We were headed down that road when considering a
dynamic window at
one point. Then you can just set a single guest's ple_gap to
zero, which
would lead to PLE being disabled for that guest. We could also
revisit
the dynamic window then.
Can be done, but lets understand why ple on is such a big
problem. Is it
possible that ple gap and SPIN_THRESHOLD are not tuned properly?
The one obvious reason I see is commit awareness inside the guest.
for
under-commit there is no necessity to do PLE, but unfortunately we
do.
atleast we return back immediately in case of potential
undercommits,
but we still incur vmexit delay.
same applies to SPIN_THRESHOLD. SPIN_THRESHOLD should be ideally
more
for undercommit and less for overcommit.
with this patch series SPIN_THRESHOLD is increased to 32k to
solely
avoid under-commit regressions but it would have eaten some amount
of
overcommit performance.
In summary: excess halt-exit/pl-exit was one main reason for
undercommit regression. (compared to pl disabled case)
I haven't yet tried these patches...hope to do so sometime soon.
Fwiw...after Raghu's last set of PLE changes that is now in 3.10-rc
kernels...I didn't notice much difference in workload performance
between PLE enabled vs. disabled. This is for under-commit (+pinned)
large guest case.
Here is a small sampling of the guest exits collected via kvm ftrace
for an OLTP-like workload which was keeping the guest ~85-90% busy
on a 8 socket Westmere-EX box (HT-off).
TIME_IN_GUEST 71.616293
TIME_ON_HOST 7.764597
MSR_READ 0.000362 0.0%
NMI_WINDOW 0.000002 0.0%
PAUSE_INSTRUCTION 0.158595 2.0%
PENDING_INTERRUPT 0.033779 0.4%
MSR_WRITE 0.001695 0.0%
EXTERNAL_INTERRUPT 3.210867 41.4%
IO_INSTRUCTION 0.000018 0.0%
RDPMC 0.000067 0.0%
HLT 2.822523 36.4%
EXCEPTION_NMI 0.008362 0.1%
CR_ACCESS 0.010027 0.1%
APIC_ACCESS 1.518300 19.6%
[ Don't mean to digress from the topic but in most of my
under-commit + pinned large guest experiments with 3.10 kernels
(using 2 or 3 different workloads) the time spent in halt exits are
typically much more than the time spent in ple exits. Can anything
be done to reduce the duration or avoid those exits ? ]
1. dynamic ple window was one solution for PLE, which we can
experiment
further. (at VM level or global).
Is this the case where the dynamic PLE window starts off at a value
more suitable to reduce exits for under-commit (and pinned) cases
and only when the host OS detects that the degree of under-commit is
shrinking (i.e. moving towards having more vcpus to schedule and
hence getting to be over committed) it adjusts the ple window more
suitable to the over commit case ? or is this some different idea ?
Thanks
Vinod
The other experiment I was thinking is to extend
spinlock to
accommodate vcpuid (Linus has opposed that but may be worth a
try).
2. Andrew Theurer had patch to reduce double runq lock
that I will be testing.
I have some older experiments to retry though they did not give
significant improvements before the PLE handler modified.
Andrew, do you have any other details to add (from perf report
that you usually take with these experiments)?
.
|