RE: [PATCH] x86/hyperv: Pass on the lpj value from host to guest

"Michael Kelley (LINUX)" <mikelley@xxxxxxxxxxxxx> · Fri, 17 Feb 2023 02:34:21 +0000

From: Stanislav Kinsburskii <skinsburskii@xxxxxxxxxxxxxxxxxxx> Sent: Thursday, February 16, 2023 11:41 AM
> 
> On Tue, Feb 14, 2023 at 04:19:13PM +0000, Michael Kelley (LINUX) wrote:
> > From: Stanislav Kinsburskii <skinsburskii@xxxxxxxxxxxxxxxxxxx>
> > >
> > > And have it preset.
> > > This change allows to significantly reduce time to bring up guest SMP
> > > configuration as well as make sure the guest won't get inaccurate
> > > calibration results due to "noisy neighbour" situation.
> > >
> > > Below are the numbers for 16 VCPU guest before the patch (~1300 msec)
> > >
> > > [    0.562938] x86: Booting SMP configuration:
> > > ...
> > > [    1.859447] smp: Brought up 1 node, 16 CPUs
> > >
> > > and after the patch (~130 msec):
> > >
> > > [    0.445079] x86: Booting SMP configuration:
> > > ...
> > > [    0.575035] smp: Brought up 1 node, 16 CPUs
> > >
> > > This change is inspired by commit 0293615f3fb9 ("x86: KVM guest: use
> > > paravirt function to calculate cpu khz").
> >
> > This patch has been nagging at me a bit, and I finally did some further
> > checking.   Looking at Linux guests on local Hyper-V and in Azure, I see
> > a dmesg output line like this during boot:
> >
> > Calibrating delay loop (skipped), value calculated using timer frequency.. 5187.81
> BogoMIPS (lpj=2593905)
> >
> > We're already skipping the delay loop calculation because lpj_fine
> > is set in tsc_init(), using the results of get_loops_per_jiffy().  The
> > latter does exactly the same calculation as hv_preset_lpj() in
> > this patch.
> >
> > Is this patch arising from an environment where tsc_init() is
> > skipped for some reason?  Just trying to make sure we fully
> > when this patch is applicable, and when not.
> >
> 
> The problem here is a bit different: "lpj_fine" is considered only for
> the boot CPU (from init/calibrate.c):
> 
>         } else if ((!printed) && lpj_fine) {
>                 lpj = lpj_fine;
>                 pr_info("Calibrating delay loop (skipped), "
>                         "value calculated using timer frequency.. ");
> 
> while all the secondary ones use the timer to calibrate.
> 
> With this change lpj_preset will be used for all cores (from
> init/calbrate.c):
> 
>         } else if (preset_lpj) {
>                 lpj = preset_lpj;
>                 if (!printed)
>                         pr_info("Calibrating delay loop (skipped) "
>                                 "preset value.. ");
> 
> This lofic with lpj_fine comes from commit 3da757daf86e ("x86: use
> cpu_khz for loops_per_jiffy calculation"), where the commit messages
> states the following:
> 
>     We do this only for the boot processor because the AP's can have
>     different base frequencies or the BIOS might boot a AP at a different
>     frequency.
> 
> Hope this helps.
> 

Indeed, you are right about lpj_fine being applied only to the boot
CPU.  So I've looked a little closer because I don't see the 1300
milliseconds you see for a 16 vCPU guest.

I've been experimenting with a 32 vCPU guest, and without your
patch, it takes only 26 milliseconds to get all 32 vCPUs started.  I
think the trick is in the call to calibrate_delay_is_known().  This
function copies the lpj value from a CPU in the same NUMA node
that has already been calibrated, assuming that constant_tsc is
set, which is the case in my test VM.  So the boot CPU sets lpj
based on lpj_fine, and all other CPUs effectively copy the value
from the boot CPU without doing calibration.

I also experimented with multiple NUMA nodes.  In that case, it
does take a longer.  Dividing the 32 vCPUs into 4 NUMA nodes,
it takes about 210 miliseconds to boot all 32 vCPUs.  Presumably the
extra time is due to timer-based calibration being done once for each
NUMA node, plus probably some misc NUMA accounting overhead.
With preset_lpj set, that 210 milliseconds drops to 32 milliseconds,
which is more like the case with only 1 NUMA nodes, so there's some
modest benefit with multiple NUMA nodes.

Could you check if constant_tsc is set in your test environment?  It
really should be set in a Hyper-V VM.

Michael