On Fri, Sep 23, 2016 at 3:07 AM, Matt Redfearn <matt.redfearn@xxxxxxxxxx> wrote: > > On 22/09/16 19:04, Justin Chen wrote: >> >> On Thu, Sep 22, 2016 at 9:15 AM, Matt Redfearn <matt.redfearn@xxxxxxxxxx> >> wrote: >>> >>> Hi Justin, >>> >>> Nice catch! Funnily enough I ran into the same deadlock today so your >>> debugging helped arrive at a quick solution. The following patch fixes >>> the >>> issue for me, does it fix it for you? >> >> Awesome! Yup, this fixes it for me. I am using a 4.1 kernel, so maybe >> we can send the patch to v4.1+? or maybe earlier? >> >> Thanks Matt! > > > Hi Justin, > Great! The patch depends on "MIPS: c-r4k: Fix cache flushing for MT cores" > for context, which was only backported as far as 4.1, so I will mark the > patch for 4.1 onwards. Am I OK to add your Tested-by? Alright sounds good. Sure. > > Thanks, > Matt > > > > >>> >>> MIPS: smp: Fix possibility of deadlock when bringing CPUs online >>> >>> This patch fixes the possibility of a deadlock when bringing up >>> secondary CPUs. >>> The deadlock occurs because the set_cpu_online() is called before >>> synchronise_count_slave(). This can cause a deadlock if the boot >>> CPU, >>> having scheduled another thread, attempts to send an IPI to the >>> secondary CPU, which it sees has been marked online. The secondary >>> is >>> blocked in synchronise_count_slave() waiting for the boot CPU to >>> enter >>> synchronise_count_master(), but the boot cpu is blocked in >>> smp_call_function_many() waiting for the secondary to respond to >>> it's >>> IPI request. >>> >>> Fix this by marking the CPU online in cpu_callin_map and >>> synchronising >>> counters before declaring the CPU online and calculating the maps >>> for >>> IPIs. >>> >>> Cc: stable@xxxxxxxxxxxxxxx # v4.2+ >>> Reported-by: Justin Chen <justinpopo6@xxxxxxxxx> >>> Signed-off-by: Matt Redfearn <matt.redfearn@xxxxxxxxxx> >>> >>> diff --git a/arch/mips/kernel/smp.c b/arch/mips/kernel/smp.c >>> index f95f094f36e4..b0baf48951fa 100644 >>> --- a/arch/mips/kernel/smp.c >>> +++ b/arch/mips/kernel/smp.c >>> @@ -322,6 +322,9 @@ asmlinkage void start_secondary(void) >>> cpumask_set_cpu(cpu, &cpu_coherent_mask); >>> notify_cpu_starting(cpu); >>> >>> + cpumask_set_cpu(cpu, &cpu_callin_map); >>> + synchronise_count_slave(cpu); >>> + >>> set_cpu_online(cpu, true); >>> >>> set_cpu_sibling_map(cpu); >>> @@ -329,10 +332,6 @@ asmlinkage void start_secondary(void) >>> >>> calculate_cpu_foreign_map(); >>> >>> - cpumask_set_cpu(cpu, &cpu_callin_map); >>> - >>> - synchronise_count_slave(cpu); >>> - >>> /* >>> * irq will be enabled in ->smp_finish(), enabling it too early >>> * is dangerous. >>> ~ >>> >>> If so, I will send the patch to Ralf. >>> >>> Thanks, >>> >>> Matt >>> >>> >>> >>> >>> On 21/09/16 22:32, Justin Chen wrote: >>>> >>>> Hello everyone, >>>> >>>> I am running into a deadlock while testing bmips power management >>>> code. Currently attempting to add power management functionality to >>>> arch/mips/configs/bmips_stb_defconfig. The kernel locks up when coming >>>> back from a suspend state. I am working on a bcm7435 board. >>>> >>>> In arch/mips/kernel/smp.c:start_secondary >>>> --- >>>> asmlinkage void start_secondary(void) >>>> { >>>> .... >>>> set_cpu_online(cpu, true); >>>> >>>> set_cpu_sibling_map(cpu); >>>> set_cpu_core_map(cpu); >>>> >>>> calculate_cpu_foreign_map(); >>>> >>>> cpumask_set_cpu(cpu, &cpu_callin_map); >>>> >>>> synchronise_count_slave(cpu); >>>> .... >>>> } >>>> --- >>>> The deadlock occurs because the set_cpu_online() is called before >>>> synchronise_count_slave(). This can cause a deadlock if the boot cpu >>>> sees that the secondary cpu is online and tries to execute a function >>>> on it before it synchronizes with it. The boot cpu ends up waiting for >>>> the secondary cpu to execute a function, while the secondary cpu waits >>>> for the boot cpu to synchronise with it. >>>> >>>> Lets assume the following occurs. >>>> >>>> 1. CPU0 starts CPU1. CPU0 starts waiting for CPU1 to start up. >>>> CPU0 ends up at arch/mips/kernel/smp.c:__cpu_up(). >>>> --- >>>> int __cpu_up(unsigned int cpu, struct task_struct *tidle) >>>> { >>>> mp_ops->boot_secondary(cpu, tidle); >>>> >>>> /* >>>> * Trust is futile. We should really have timeouts ... >>>> */ >>>> while (!cpumask_test_cpu(cpu, &cpu_callin_map)) { >>>> udelay(100); >>>> schedule(); >>>> } >>>> >>>> synchronise_count_master(cpu); >>>> return 0; >>>> } >>>> --- >>>> While CPU0 waits for CPU1 it schedules another thread. >>>> >>>> 2. CPU0 begins executing a new thread and eventually ends up at >>>> kernel/smp.c:smp_call_function_many() >>>> --- >>>> void smp_call_function_many(const struct cpumask *mask, >>>> smp_call_func_t func, void *info, bool >>>> wait) >>>> { >>>> .... >>>> /* No online cpus? We're done. */ >>>> if (cpu >= nr_cpu_ids) >>>> return; >>>> >>>> /* Do we have another CPU which isn't us? */ >>>> next_cpu = cpumask_next_and(cpu, mask, cpu_online_mask); >>>> if (next_cpu == this_cpu) >>>> next_cpu = cpumask_next_and(next_cpu, mask, >>>> cpu_online_mask); >>>> >>>> /* Fastpath: do that cpu by itself. */ >>>> if (next_cpu >= nr_cpu_ids) { >>>> smp_call_function_single(cpu, func, info, wait); >>>> return; >>>> } >>>> .... >>>> } >>>> --- >>>> 3. CPU1 executes set_cpu_online() and blocks at >>>> synchronise_count_slave(). Thus CPU1 is blocked, however it tells >>>> everyone it is online. >>>> >>>> 4. CPU0(in kernel/smp.c:smp_call_function_many()) sees that one CPU is >>>> online and attempts to a run a function on that CPU(which is CPU1). >>>> CPU0 then blocks with no preempt and irqs off. Thus both CPUs are >>>> deadlocked. >>>> CPU0 is blocked at smp_call_function_single() >>>> CPU1 is blocked at synchronise_count_slave() >>>> >>>> I am running into this issue with this execution. >>>> kernel/power/standby.c: suspend_enter() >>>> kernel/power/standby.c: syscore_resume() (Coming out of suspend, only 1 >>>> cpu up) >>>> kernel/time/timekeeping.c: timekeeping_resume() (syscore calls the >>>> timekeeping resume hook) >>>> kernel/time/hrtimer.c: hrtimer_resume() >>>> kernel/time/hrtimer.c: clock_was_set_delayed() (We schedule some work >>>> for >>>> later) >>>> ... >>>> CPU0 then starts up CPU1 >>>> arch/mips/kernel/smp.c: __cpu_up() (Comes here and decides to schedule >>>> the hrtimer thread) >>>> kernel/time/hrtimer.c: clock_was_set_work() >>>> kernel/time/hrtimer.c: clock_was_set() >>>> kernel/time/hrtimer.c: on_each_cpu() (and this is where we get screwed) >>>> ... >>>> kernel/smp.c: smp_call_function_many() (Eventually gets here and >>>> blocks if CPU1 already executed set_cpu_online()) >>>> >>>> The deadlock doesn't happen when I test pm with no_console_suspend. I >>>> am assuming this happens because things get printed out before >>>> "set_cpu_online()" gets executed. Thus delaying the timing. Then CPU0 >>>> does not see that CPU1 is online when running >>>> "smp_call_function_many()". >>>> >>>> Am I seeing this correctly? What would be the proper fix to this? >>>> >>>> Thanks, >>>> Justin >>>> >