On Thu, Sep 22, 2016 at 9:15 AM, Matt Redfearn <matt.redfearn@xxxxxxxxxx> wrote: > Hi Justin, > > Nice catch! Funnily enough I ran into the same deadlock today so your > debugging helped arrive at a quick solution. The following patch fixes the > issue for me, does it fix it for you? Awesome! Yup, this fixes it for me. I am using a 4.1 kernel, so maybe we can send the patch to v4.1+? or maybe earlier? Thanks Matt! > > > MIPS: smp: Fix possibility of deadlock when bringing CPUs online > > This patch fixes the possibility of a deadlock when bringing up > secondary CPUs. > The deadlock occurs because the set_cpu_online() is called before > synchronise_count_slave(). This can cause a deadlock if the boot CPU, > having scheduled another thread, attempts to send an IPI to the > secondary CPU, which it sees has been marked online. The secondary is > blocked in synchronise_count_slave() waiting for the boot CPU to enter > synchronise_count_master(), but the boot cpu is blocked in > smp_call_function_many() waiting for the secondary to respond to it's > IPI request. > > Fix this by marking the CPU online in cpu_callin_map and synchronising > counters before declaring the CPU online and calculating the maps for > IPIs. > > Cc: stable@xxxxxxxxxxxxxxx # v4.2+ > Reported-by: Justin Chen <justinpopo6@xxxxxxxxx> > Signed-off-by: Matt Redfearn <matt.redfearn@xxxxxxxxxx> > > diff --git a/arch/mips/kernel/smp.c b/arch/mips/kernel/smp.c > index f95f094f36e4..b0baf48951fa 100644 > --- a/arch/mips/kernel/smp.c > +++ b/arch/mips/kernel/smp.c > @@ -322,6 +322,9 @@ asmlinkage void start_secondary(void) > cpumask_set_cpu(cpu, &cpu_coherent_mask); > notify_cpu_starting(cpu); > > + cpumask_set_cpu(cpu, &cpu_callin_map); > + synchronise_count_slave(cpu); > + > set_cpu_online(cpu, true); > > set_cpu_sibling_map(cpu); > @@ -329,10 +332,6 @@ asmlinkage void start_secondary(void) > > calculate_cpu_foreign_map(); > > - cpumask_set_cpu(cpu, &cpu_callin_map); > - > - synchronise_count_slave(cpu); > - > /* > * irq will be enabled in ->smp_finish(), enabling it too early > * is dangerous. > ~ > > If so, I will send the patch to Ralf. > > Thanks, > > Matt > > > > > On 21/09/16 22:32, Justin Chen wrote: >> >> Hello everyone, >> >> I am running into a deadlock while testing bmips power management >> code. Currently attempting to add power management functionality to >> arch/mips/configs/bmips_stb_defconfig. The kernel locks up when coming >> back from a suspend state. I am working on a bcm7435 board. >> >> In arch/mips/kernel/smp.c:start_secondary >> --- >> asmlinkage void start_secondary(void) >> { >> .... >> set_cpu_online(cpu, true); >> >> set_cpu_sibling_map(cpu); >> set_cpu_core_map(cpu); >> >> calculate_cpu_foreign_map(); >> >> cpumask_set_cpu(cpu, &cpu_callin_map); >> >> synchronise_count_slave(cpu); >> .... >> } >> --- >> The deadlock occurs because the set_cpu_online() is called before >> synchronise_count_slave(). This can cause a deadlock if the boot cpu >> sees that the secondary cpu is online and tries to execute a function >> on it before it synchronizes with it. The boot cpu ends up waiting for >> the secondary cpu to execute a function, while the secondary cpu waits >> for the boot cpu to synchronise with it. >> >> Lets assume the following occurs. >> >> 1. CPU0 starts CPU1. CPU0 starts waiting for CPU1 to start up. >> CPU0 ends up at arch/mips/kernel/smp.c:__cpu_up(). >> --- >> int __cpu_up(unsigned int cpu, struct task_struct *tidle) >> { >> mp_ops->boot_secondary(cpu, tidle); >> >> /* >> * Trust is futile. We should really have timeouts ... >> */ >> while (!cpumask_test_cpu(cpu, &cpu_callin_map)) { >> udelay(100); >> schedule(); >> } >> >> synchronise_count_master(cpu); >> return 0; >> } >> --- >> While CPU0 waits for CPU1 it schedules another thread. >> >> 2. CPU0 begins executing a new thread and eventually ends up at >> kernel/smp.c:smp_call_function_many() >> --- >> void smp_call_function_many(const struct cpumask *mask, >> smp_call_func_t func, void *info, bool wait) >> { >> .... >> /* No online cpus? We're done. */ >> if (cpu >= nr_cpu_ids) >> return; >> >> /* Do we have another CPU which isn't us? */ >> next_cpu = cpumask_next_and(cpu, mask, cpu_online_mask); >> if (next_cpu == this_cpu) >> next_cpu = cpumask_next_and(next_cpu, mask, >> cpu_online_mask); >> >> /* Fastpath: do that cpu by itself. */ >> if (next_cpu >= nr_cpu_ids) { >> smp_call_function_single(cpu, func, info, wait); >> return; >> } >> .... >> } >> --- >> 3. CPU1 executes set_cpu_online() and blocks at >> synchronise_count_slave(). Thus CPU1 is blocked, however it tells >> everyone it is online. >> >> 4. CPU0(in kernel/smp.c:smp_call_function_many()) sees that one CPU is >> online and attempts to a run a function on that CPU(which is CPU1). >> CPU0 then blocks with no preempt and irqs off. Thus both CPUs are >> deadlocked. >> CPU0 is blocked at smp_call_function_single() >> CPU1 is blocked at synchronise_count_slave() >> >> I am running into this issue with this execution. >> kernel/power/standby.c: suspend_enter() >> kernel/power/standby.c: syscore_resume() (Coming out of suspend, only 1 >> cpu up) >> kernel/time/timekeeping.c: timekeeping_resume() (syscore calls the >> timekeeping resume hook) >> kernel/time/hrtimer.c: hrtimer_resume() >> kernel/time/hrtimer.c: clock_was_set_delayed() (We schedule some work for >> later) >> ... >> CPU0 then starts up CPU1 >> arch/mips/kernel/smp.c: __cpu_up() (Comes here and decides to schedule >> the hrtimer thread) >> kernel/time/hrtimer.c: clock_was_set_work() >> kernel/time/hrtimer.c: clock_was_set() >> kernel/time/hrtimer.c: on_each_cpu() (and this is where we get screwed) >> ... >> kernel/smp.c: smp_call_function_many() (Eventually gets here and >> blocks if CPU1 already executed set_cpu_online()) >> >> The deadlock doesn't happen when I test pm with no_console_suspend. I >> am assuming this happens because things get printed out before >> "set_cpu_online()" gets executed. Thus delaying the timing. Then CPU0 >> does not see that CPU1 is online when running >> "smp_call_function_many()". >> >> Am I seeing this correctly? What would be the proper fix to this? >> >> Thanks, >> Justin >> >