On Thu, Aug 03, 2017 at 08:20:22AM +0200, Matija Glavinic Pecotic wrote: > While testing cpu hoptlug (cpu down and up in loops) on kernel 4.4, it was > observed that occasionally check for cpu online will fail in kernel/cpu.c, > _cpu_up: > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/tree/kernel/cpu.c?h=v4.4.79#n485 > 518 /* Arch-specific enabling code. */ > 519 ret = __cpu_up(cpu, idle); > 520 > 521 if (ret != 0) > 522 goto out_notify; > 523 BUG_ON(!cpu_online(cpu)); > > Reason is race between start_secondary and _cpu_up. cpu_callin_map is set > before cpu_online_mask. In __cpu_up, cpu_callin_map is waited for, but cpu > online mask is not, resulting in race in which secondary processor started > and set cpu_callin_map, but not yet set the online mask,resulting in above > BUG being hit. > > Upstream differs in the area. cpu_online check is in bringup_wait_for_ap, > which is after cpu reached AP_ONLINE_IDLE,where secondary passed its start > function. Nonetheless, fix makes start_secondary safe and not depending on > other locks throughout the code. It protects as well against cpu_online > checks put in between sometimes in the future. > > Fix this by moving completion after all flags are set. > > Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@xxxxxxxxx> > --- > arch/mips/kernel/smp.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/arch/mips/kernel/smp.c b/arch/mips/kernel/smp.c > index 770d4d1..6bace76 100644 > --- a/arch/mips/kernel/smp.c > +++ b/arch/mips/kernel/smp.c > @@ -376,9 +376,6 @@ asmlinkage void start_secondary(void) > cpumask_set_cpu(cpu, &cpu_coherent_mask); > notify_cpu_starting(cpu); > > - complete(&cpu_running); > - synchronise_count_slave(cpu); > - > set_cpu_online(cpu, true); > > set_cpu_sibling_map(cpu); > @@ -386,6 +383,9 @@ asmlinkage void start_secondary(void) > > calculate_cpu_foreign_map(); > > + complete(&cpu_running); > + synchronise_count_slave(cpu); > + > /* > * irq will be enabled in ->smp_finish(), enabling it too early > * is dangerous. Makes sense, applied - but after applying I was wondering if synchronise_count_slave() should be before the complete, not after. Ralf