On Mon, 2023-03-20 at 15:30 +0100, Thomas Gleixner wrote: > > This causes a subtle issue. The bringup loop above moves all CPUs to > cpuhp_state == CPUHP_BP_PARALLEL_DYN_END. So the serial bootup will > start from there and bring them fully up. > > Now if a bringup fails, then the rollback will only go back down to > CPUHP_BP_PARALLEL_DYN_END, which means that the control CPU won't do any > cleanups below CPUHP_BP_PARALLEL_DYN_END. > > That 'fail' is a common case for SMT soft disable via the 'nosmt' > command line parameter. Due to the marvelous MCE broadcast 'feature' we > need to bringup the SMT siblings at least to the CPUHP_AP_ONLINE_IDLE > state once and then roll them back. > > While this is not necessarily a fatal problem, it's changing behaviour > and with quite some of the details hidden in the (then not issued) > teardown callbacks might cause some hard to decode subtle surprises. > > So that second for_each_present_cpu() loop needs to check the return > value of cpu_up() and issue a full rollback to CPUHP_OFFLINE in case of > fail. @@ -1524,8 +1531,22 @@ void bringup_nonboot_cpus(unsigned int setup_max_cpus) for_each_present_cpu(cpu) { if (num_online_cpus() >= setup_max_cpus) break; - if (!cpu_online(cpu)) - cpu_up(cpu, CPUHP_ONLINE); + if (!cpu_online(cpu)) { + int ret = cpu_up(cpu, CPUHP_ONLINE); + + /* + * For the parallel bringup case, roll all the way back + * to CPUHP_OFFLINE on failure; don't leave them in the + * parallel stages. This happens in the nosmt case for + * non-primary threads. + */ + if (ret && cpuhp_hp_states[CPUHP_BP_PARALLEL_DYN].name) { + struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); + if (can_rollback_cpu(st)) + WARN_ON(cpuhp_invoke_callback_range(false, cpu, st, + CPUHP_OFFLINE)); + } + } } }
Attachment:
smime.p7s
Description: S/MIME cryptographic signature