Re: [PATCH 0/2] KVM: PPC: Book3S HV: Dynamic micro-threading/split-core

Paul Mackerras <paulus@xxxxxxxxx> · Mon, 22 Jun 2015 10:09:02 +1000

On Wed, Jun 17, 2015 at 07:30:09PM +0200, Laurent Vivier wrote:
> 
> Tested-by: Laurent Vivier <lvivier@xxxxxxxxxx>
> 
> Performance is better, but Paul could you explain why it is better if I disable dynamic micro-threading ?
> Did I miss something ?
> 
> My test system is an IBM Power S822L.
> 
> I run two guests with 8 vCPUs (-smp 8,sockets=8,cores=1,threads=1) both
> attached on the same core (with pinning option of virt-manager). Then, I
> measure the time needed to compile a kernel in parallel in both guests
> with "make -j 16".
> 
> My kernel without micro-threading:
> 
> real    37m23.424s                 real    37m24.959s
> user    167m31.474s                user    165m44.142s
> sys     113m26.195s                sys     113m45.072s
> 
> With micro-threading patches (PATCH 1+2):
> 
> target_smt_mode 0 [in fact It was 8 here, but it should behave like 0, as it is > max threads/sub-core]
> dynamic_mt_modes 6
> 
> real    32m13.338s                 real  32m26.652s
> user    139m21.181s                user  140m20.994s
> sys     77m35.339s                 sys   78m16.599s
> 
> It's better, but if I disable dynamic micro-threading (but PATCH 1+2):
> 
> target_smt_mode 0
> dynamic_mt_modes 0
> 
> real    30m49.100s                 real 30m48.161s
> user    144m22.989s                user 142m53.886s
> sys     65m4.942s                  sys  66m8.159s
> 
> it's even better.

I think what's happening here is that with dynamic_mt_modes=0 the
system alternates between the two guests, whereas with
dynamic_mt_modes=6 it will spend some of the time running both guests
simultaneously in two-way split mode.  Since you have two
compute-bound guests that each have threads=1 and 8 vcpus, it can fill
up the core either way.  In that case it is more efficient to fill up
the core with vcpus from one guest and not have to split the core,
firstly because you avoid the split/unsplit latency and secondly
because the threads run a little faster in whole-core mode than in
split-core.

I am considering adding an additional heuristic, which would be to do
two passes through the list of preempted vcores, considering only
vcores from the same guest as the primary vcore on the first pass, and
then considering all vcores on the second pass.  Maybe we could then
also say after the first pass that if we have collected 4 or more
runnable vcpus we don't bother with the second pass.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in