David! On Wed, Feb 22 2023 at 10:11, David Woodhouse wrote: > On Wed, 2023-02-15 at 14:54 +0000, Usama Arif wrote: > So the next thing that might be worth looking at is allowing the APs > all to be running their hotplug thread simultaneously, bringing > themselves from CPUHP_BRINGUP_CPU to CPUHP_AP_ONLINE. This series eats > the initial INIT/SIPI/SIPI latency, but if there's any significant time > in the AP hotplug thread, that could be worth parallelising. On a 112 CPU machine (64 cores, HT enabled) the bringup takes Setup and SIPIs sent: 49 ms Bringup each CPU: 516 ms That's about 500 ms faster than a non-parallel bringup! Now looking at the 516 ms, which is ~4.7 ms/CPU. The vast majority of the time is spent on the APs in cpu_init() -> ucode_cpu_init() for the primary threads of each core. The secondary threads are quickly (1us) out of ucode_cpu_init() because the primary thread already loaded it. A microcode load on that machine takes ~7.5 ms per primary thread on average which sums up to 7.5 * 55 = 412.5 ms The threaded bringup after CPU_AP_ONLINE takes about 100us per CPU. identify_secondary_cpu() is one of the longer functions which takes ~125us / CPU summing up to 13ms The TSC sync check for the first CPU on the second socket consumes 20ms. That's only once per socket, intra socket is using MSR_TSC_ADJUST, which is more or less free. So the 516 ms are wasted here: total 516 ms ucode_cpu_init() 412 ms identify_secondary_cpu() 13 ms 2ndsocket_tsc_sync 20 ms threaded bringup 12 ms rest 59 ms So the rest is about 530us per CPU, which is just the sum of many small functions, lock contentions... Getting rid of the micro code overhead is possible. There is no reason to serialize that between the cores. But it needs serialization vs. HT siblings, which requires to move identify_secondary_cpu() and its caller smp_store_cpu_info() ahead of the synchronization point and then have serialization between the siblings. That's going to be a major surgery and inspection effort to ensure that there are no hidden assumptions about global hotplug serialization. So that would cut the total cost down to ~100ms plus the preparatory/SIPI stage of 60ms which sums up to about 160ms and about 1.5ms per CPU total. Further optimization starts to be questionable IMO. It's surely possible somehow, but then you really have to go and inspect each and every function in those code pathes, add local locking, etc. Not to talk about the required mess in the core code to support that. The low hanging fruit which brings most is the identification/topology muck and the microcode loading. That needs to be addressed first anyway. Thanks, tglx