Re: [External] Re: [PATCH v9 0/8] Parallel CPU bringup for x86_64

Usama Arif <usama.arif@xxxxxxxxxxxxx> · Thu, 23 Feb 2023 19:24:26 +0000

On 23/02/2023 11:07, David Woodhouse wrote:
On Wed, 2023-02-22 at 17:42 +0100, Thomas Gleixner wrote:
David!

On Wed, Feb 22 2023 at 10:11, David Woodhouse wrote:
On Wed, 2023-02-15 at 14:54 +0000, Usama Arif wrote:
So the next thing that might be worth looking at is allowing the APs
all to be running their hotplug thread simultaneously, bringing
themselves from CPUHP_BRINGUP_CPU to CPUHP_AP_ONLINE. This series eats
the initial INIT/SIPI/SIPI latency, but if there's any significant time
in the AP hotplug thread, that could be worth parallelising.

On a 112 CPU machine (64 cores, HT enabled) the bringup takes

Setup and SIPIs sent:    49 ms
Bringup each CPU:       516 ms

That's about 500 ms faster than a non-parallel bringup!

Now looking at the 516 ms, which is ~4.7 ms/CPU. The vast majority of the
time is spent on the APs in

      cpu_init() -> ucode_cpu_init()

for the primary threads of each core. The secondary threads are quickly
(1us) out of ucode_cpu_init() because the primary thread already loaded
it.

A microcode load on that machine takes ~7.5 ms per primary thread on
average which sums up to 7.5 * 55 = 412.5 ms

The threaded bringup after CPU_AP_ONLINE takes about 100us per CPU.

Nice analysis; thanks!

identify_secondary_cpu() is one of the longer functions which takes
~125us / CPU summing up to 13ms

Hm, shouldn't that one already be parallelised by my 'part 2' patch?

It's called from smp_store_cpu_info(), from smp_callin(), which is
called from somewhere in the middle of start_secondary().

And if the comments I helpfully added to that function for the benefit
of our future selves are telling the truth, the AP is free to get that
far once the BSP has set its bit in cpu_callout_mask, which happens in
do_wait_cpu_initialized().

So
https://git.infradead.org/users/dwmw2/linux.git/commitdiff/4b5731e05b0#patch3
ought to parallelise that. But Usama emirically reported that 'part 2'
didn't add any noticeable benefit, not even those 13ms? On a *larger*
machine.

So I am using a similar machine to Thomas 128 CPU machine (64 cores, HT 
enabled). I have microcode config disabled, so I guess I get similar 
numbers to Thomas, i.e. 100ms (516 - 412) ms. I do see a difference of 
~3ms with part2 which I thought is maybe within the margin of error for 
measuring, but I guess it isn't. After seeing the ~70ms that is cut with 
reusing timer calibration, I didnt really then focus much on part 2 
then. I guess that ~70ms is the "rest" from Thomas' table below?

Thanks,
Usama

The TSC sync check for the first CPU on the second socket consumes
20ms. That's only once per socket, intra socket is using MSR_TSC_ADJUST,
which is more or less free.

So the 516 ms are wasted here:

    total                                516 ms
    ucode_cpu_init()                     412 ms
    identify_secondary_cpu()              13 ms
    2ndsocket_tsc_sync                    20 ms
    threaded bringup                      12 ms
    rest                                  59 ms

So the rest is about 530us per CPU, which is just the sum of many small
functions, lock contentions...

Getting rid of the micro code overhead is possible. There is no reason
to serialize that between the cores. But it needs serialization vs. HT
siblings, which requires to move identify_secondary_cpu() and its caller
smp_store_cpu_info() ahead of the synchronization point and then have
serialization between the siblings. That's going to be a major surgery
and inspection effort to ensure that there are no hidden assumptions
about global hotplug serialization.

So that would cut the total cost down to ~100ms plus the
preparatory/SIPI stage of 60ms which sums up to about 160ms and about
1.5ms per CPU total.

Further optimization starts to be questionable IMO. It's surely possible
somehow, but then you really have to go and inspect each and every
function in those code pathes, add local locking, etc. Not to talk about
the required mess in the core code to support that.

The low hanging fruit which brings most is the identification/topology
muck and the microcode loading. That needs to be addressed first anyway.

Agreed, thanks.