Re: [PATCH] x86/hyperv: Pass on the lpj value from host to guest

Stanislav Kinsburskii <skinsburskii@xxxxxxxxxxxxxxxxxxx> · Fri, 17 Feb 2023 14:07:43 -0800

On Fri, Feb 17, 2023 at 02:34:21AM +0000, Michael Kelley (LINUX) wrote:
> From: Stanislav Kinsburskii <skinsburskii@xxxxxxxxxxxxxxxxxxx> Sent: Thursday, February 16, 2023 11:41 AM
> > 
> > On Tue, Feb 14, 2023 at 04:19:13PM +0000, Michael Kelley (LINUX) wrote:
> > > From: Stanislav Kinsburskii <skinsburskii@xxxxxxxxxxxxxxxxxxx>
> > > >
> > > > And have it preset.
> > > > This change allows to significantly reduce time to bring up guest SMP
> > > > configuration as well as make sure the guest won't get inaccurate
> > > > calibration results due to "noisy neighbour" situation.
> > > >
> > > > Below are the numbers for 16 VCPU guest before the patch (~1300 msec)
> > > >
> > > > [    0.562938] x86: Booting SMP configuration:
> > > > ...
> > > > [    1.859447] smp: Brought up 1 node, 16 CPUs
> > > >
> > > > and after the patch (~130 msec):
> > > >
> > > > [    0.445079] x86: Booting SMP configuration:
> > > > ...
> > > > [    0.575035] smp: Brought up 1 node, 16 CPUs
> > > >
> > > > This change is inspired by commit 0293615f3fb9 ("x86: KVM guest: use
> > > > paravirt function to calculate cpu khz").
> > >
> > > This patch has been nagging at me a bit, and I finally did some further
> > > checking.   Looking at Linux guests on local Hyper-V and in Azure, I see
> > > a dmesg output line like this during boot:
> > >
> > > Calibrating delay loop (skipped), value calculated using timer frequency.. 5187.81
> > BogoMIPS (lpj=2593905)
> > >
> > > We're already skipping the delay loop calculation because lpj_fine
> > > is set in tsc_init(), using the results of get_loops_per_jiffy().  The
> > > latter does exactly the same calculation as hv_preset_lpj() in
> > > this patch.
> > >
> > > Is this patch arising from an environment where tsc_init() is
> > > skipped for some reason?  Just trying to make sure we fully
> > > when this patch is applicable, and when not.
> > >
> > 
> > The problem here is a bit different: "lpj_fine" is considered only for
> > the boot CPU (from init/calibrate.c):
> > 
> >         } else if ((!printed) && lpj_fine) {
> >                 lpj = lpj_fine;
> >                 pr_info("Calibrating delay loop (skipped), "
> >                         "value calculated using timer frequency.. ");
> > 
> > while all the secondary ones use the timer to calibrate.
> > 
> > With this change lpj_preset will be used for all cores (from
> > init/calbrate.c):
> > 
> >         } else if (preset_lpj) {
> >                 lpj = preset_lpj;
> >                 if (!printed)
> >                         pr_info("Calibrating delay loop (skipped) "
> >                                 "preset value.. ");
> > 
> > This lofic with lpj_fine comes from commit 3da757daf86e ("x86: use
> > cpu_khz for loops_per_jiffy calculation"), where the commit messages
> > states the following:
> > 
> >     We do this only for the boot processor because the AP's can have
> >     different base frequencies or the BIOS might boot a AP at a different
> >     frequency.
> > 
> > Hope this helps.
> > 
> 
> Indeed, you are right about lpj_fine being applied only to the boot
> CPU.  So I've looked a little closer because I don't see the 1300
> milliseconds you see for a 16 vCPU guest.
> 
> I've been experimenting with a 32 vCPU guest, and without your
> patch, it takes only 26 milliseconds to get all 32 vCPUs started.  I
> think the trick is in the call to calibrate_delay_is_known().  This
> function copies the lpj value from a CPU in the same NUMA node
> that has already been calibrated, assuming that constant_tsc is
> set, which is the case in my test VM.  So the boot CPU sets lpj
> based on lpj_fine, and all other CPUs effectively copy the value
> from the boot CPU without doing calibration.
> 
> I also experimented with multiple NUMA nodes.  In that case, it
> does take a longer.  Dividing the 32 vCPUs into 4 NUMA nodes,
> it takes about 210 miliseconds to boot all 32 vCPUs.  Presumably the
> extra time is due to timer-based calibration being done once for each
> NUMA node, plus probably some misc NUMA accounting overhead.
> With preset_lpj set, that 210 milliseconds drops to 32 milliseconds,
> which is more like the case with only 1 NUMA nodes, so there's some
> modest benefit with multiple NUMA nodes.
> 
> Could you check if constant_tsc is set in your test environment?  It
> really should be set in a Hyper-V VM.
> 

I guess I should have mentioned, that the results presented in the
commit message are from L2 guest, where there are no NUMA nodes and thus
every core is calibrated individually and thus boot time grows linearly
with the number of the cores assigned.

I'm not sure though, would NUMA emulation be a right choice here or
should this boot time penalty be left as is because we can't guarantee
all the processes are in the same numa node and thus their lpj values
have to be measured.

What do you think, Michael?

Thanks,
Stanislav

> Michael