From: Stanislav Kinsburskii <skinsburskii@xxxxxxxxxxxxxxxxxxx> Sent: Thursday, February 16, 2023 11:41 AM > > On Tue, Feb 14, 2023 at 04:19:13PM +0000, Michael Kelley (LINUX) wrote: > > From: Stanislav Kinsburskii <skinsburskii@xxxxxxxxxxxxxxxxxxx> > > > > > > And have it preset. > > > This change allows to significantly reduce time to bring up guest SMP > > > configuration as well as make sure the guest won't get inaccurate > > > calibration results due to "noisy neighbour" situation. > > > > > > Below are the numbers for 16 VCPU guest before the patch (~1300 msec) > > > > > > [ 0.562938] x86: Booting SMP configuration: > > > ... > > > [ 1.859447] smp: Brought up 1 node, 16 CPUs > > > > > > and after the patch (~130 msec): > > > > > > [ 0.445079] x86: Booting SMP configuration: > > > ... > > > [ 0.575035] smp: Brought up 1 node, 16 CPUs > > > > > > This change is inspired by commit 0293615f3fb9 ("x86: KVM guest: use > > > paravirt function to calculate cpu khz"). > > > > This patch has been nagging at me a bit, and I finally did some further > > checking. Looking at Linux guests on local Hyper-V and in Azure, I see > > a dmesg output line like this during boot: > > > > Calibrating delay loop (skipped), value calculated using timer frequency.. 5187.81 > BogoMIPS (lpj=2593905) > > > > We're already skipping the delay loop calculation because lpj_fine > > is set in tsc_init(), using the results of get_loops_per_jiffy(). The > > latter does exactly the same calculation as hv_preset_lpj() in > > this patch. > > > > Is this patch arising from an environment where tsc_init() is > > skipped for some reason? Just trying to make sure we fully > > when this patch is applicable, and when not. > > > > The problem here is a bit different: "lpj_fine" is considered only for > the boot CPU (from init/calibrate.c): > > } else if ((!printed) && lpj_fine) { > lpj = lpj_fine; > pr_info("Calibrating delay loop (skipped), " > "value calculated using timer frequency.. "); > > while all the secondary ones use the timer to calibrate. > > With this change lpj_preset will be used for all cores (from > init/calbrate.c): > > } else if (preset_lpj) { > lpj = preset_lpj; > if (!printed) > pr_info("Calibrating delay loop (skipped) " > "preset value.. "); > > This lofic with lpj_fine comes from commit 3da757daf86e ("x86: use > cpu_khz for loops_per_jiffy calculation"), where the commit messages > states the following: > > We do this only for the boot processor because the AP's can have > different base frequencies or the BIOS might boot a AP at a different > frequency. > > Hope this helps. > Indeed, you are right about lpj_fine being applied only to the boot CPU. So I've looked a little closer because I don't see the 1300 milliseconds you see for a 16 vCPU guest. I've been experimenting with a 32 vCPU guest, and without your patch, it takes only 26 milliseconds to get all 32 vCPUs started. I think the trick is in the call to calibrate_delay_is_known(). This function copies the lpj value from a CPU in the same NUMA node that has already been calibrated, assuming that constant_tsc is set, which is the case in my test VM. So the boot CPU sets lpj based on lpj_fine, and all other CPUs effectively copy the value from the boot CPU without doing calibration. I also experimented with multiple NUMA nodes. In that case, it does take a longer. Dividing the 32 vCPUs into 4 NUMA nodes, it takes about 210 miliseconds to boot all 32 vCPUs. Presumably the extra time is due to timer-based calibration being done once for each NUMA node, plus probably some misc NUMA accounting overhead. With preset_lpj set, that 210 milliseconds drops to 32 milliseconds, which is more like the case with only 1 NUMA nodes, so there's some modest benefit with multiple NUMA nodes. Could you check if constant_tsc is set in your test environment? It really should be set in a Hyper-V VM. Michael