Re: [RFC] Timing hazard in arch/mips/kernel/smp.c:start_secondary

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Sep 22, 2016 at 9:15 AM, Matt Redfearn <matt.redfearn@xxxxxxxxxx> wrote:
> Hi Justin,
>
> Nice catch! Funnily enough I ran into the same deadlock today so your
> debugging helped arrive at a quick solution. The following patch fixes the
> issue for me, does it fix it for you?
Awesome! Yup, this fixes it for me. I am using a 4.1 kernel, so maybe
we can send the patch to v4.1+? or maybe earlier?

Thanks Matt!
>
>
>     MIPS: smp: Fix possibility of deadlock when bringing CPUs online
>
>     This patch fixes the possibility of a deadlock when bringing up
>     secondary CPUs.
>     The deadlock occurs because the set_cpu_online() is called before
>     synchronise_count_slave(). This can cause a deadlock if the boot CPU,
>     having scheduled another thread, attempts to send an IPI to the
>     secondary CPU, which it sees has been marked online. The secondary is
>     blocked in synchronise_count_slave() waiting for the boot CPU to enter
>     synchronise_count_master(), but the boot cpu is blocked in
>     smp_call_function_many() waiting for the secondary to respond to it's
>     IPI request.
>
>     Fix this by marking the CPU online in cpu_callin_map and synchronising
>     counters before declaring the CPU online and calculating the maps for
>     IPIs.
>
>     Cc: stable@xxxxxxxxxxxxxxx # v4.2+
>     Reported-by: Justin Chen <justinpopo6@xxxxxxxxx>
>     Signed-off-by: Matt Redfearn <matt.redfearn@xxxxxxxxxx>
>
> diff --git a/arch/mips/kernel/smp.c b/arch/mips/kernel/smp.c
> index f95f094f36e4..b0baf48951fa 100644
> --- a/arch/mips/kernel/smp.c
> +++ b/arch/mips/kernel/smp.c
> @@ -322,6 +322,9 @@ asmlinkage void start_secondary(void)
>         cpumask_set_cpu(cpu, &cpu_coherent_mask);
>         notify_cpu_starting(cpu);
>
> +       cpumask_set_cpu(cpu, &cpu_callin_map);
> +       synchronise_count_slave(cpu);
> +
>         set_cpu_online(cpu, true);
>
>         set_cpu_sibling_map(cpu);
> @@ -329,10 +332,6 @@ asmlinkage void start_secondary(void)
>
>         calculate_cpu_foreign_map();
>
> -       cpumask_set_cpu(cpu, &cpu_callin_map);
> -
> -       synchronise_count_slave(cpu);
> -
>         /*
>          * irq will be enabled in ->smp_finish(), enabling it too early
>          * is dangerous.
> ~
>
> If so, I will send the patch to Ralf.
>
> Thanks,
>
> Matt
>
>
>
>
> On 21/09/16 22:32, Justin Chen wrote:
>>
>> Hello everyone,
>>
>> I am running into a deadlock while testing bmips power management
>> code. Currently attempting to add power management functionality to
>> arch/mips/configs/bmips_stb_defconfig. The kernel locks up when coming
>> back from a suspend state. I am working on a bcm7435 board.
>>
>> In arch/mips/kernel/smp.c:start_secondary
>> ---
>> asmlinkage void start_secondary(void)
>> {
>>          ....
>>          set_cpu_online(cpu, true);
>>
>>          set_cpu_sibling_map(cpu);
>>          set_cpu_core_map(cpu);
>>
>>          calculate_cpu_foreign_map();
>>
>>          cpumask_set_cpu(cpu, &cpu_callin_map);
>>
>>          synchronise_count_slave(cpu);
>>          ....
>> }
>> ---
>> The deadlock occurs because the set_cpu_online() is called before
>> synchronise_count_slave(). This can cause a deadlock if the boot cpu
>> sees that the secondary cpu is online and tries to execute a function
>> on it before it synchronizes with it. The boot cpu ends up waiting for
>> the secondary cpu to execute a function, while the secondary cpu waits
>> for the boot cpu to synchronise with it.
>>
>> Lets assume the following occurs.
>>
>> 1. CPU0 starts CPU1. CPU0 starts waiting for CPU1 to start up.
>> CPU0 ends up at arch/mips/kernel/smp.c:__cpu_up().
>> ---
>> int __cpu_up(unsigned int cpu, struct task_struct *tidle)
>> {
>>          mp_ops->boot_secondary(cpu, tidle);
>>
>>          /*
>>           * Trust is futile.  We should really have timeouts ...
>>           */
>>          while (!cpumask_test_cpu(cpu, &cpu_callin_map)) {
>>                  udelay(100);
>>                  schedule();
>>          }
>>
>>          synchronise_count_master(cpu);
>>          return 0;
>> }
>> ---
>> While CPU0 waits for CPU1 it schedules another thread.
>>
>> 2. CPU0 begins executing a new thread and eventually ends up at
>> kernel/smp.c:smp_call_function_many()
>> ---
>> void smp_call_function_many(const struct cpumask *mask,
>>                              smp_call_func_t func, void *info, bool wait)
>> {
>>          ....
>>          /* No online cpus?  We're done. */
>>          if (cpu >= nr_cpu_ids)
>>                  return;
>>
>>          /* Do we have another CPU which isn't us? */
>>          next_cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
>>          if (next_cpu == this_cpu)
>>                  next_cpu = cpumask_next_and(next_cpu, mask,
>> cpu_online_mask);
>>
>>          /* Fastpath: do that cpu by itself. */
>>          if (next_cpu >= nr_cpu_ids) {
>>                  smp_call_function_single(cpu, func, info, wait);
>>                  return;
>>          }
>>          ....
>> }
>> ---
>> 3. CPU1 executes set_cpu_online() and blocks at
>> synchronise_count_slave(). Thus CPU1 is blocked, however it tells
>> everyone it is online.
>>
>> 4. CPU0(in kernel/smp.c:smp_call_function_many()) sees that one CPU is
>> online and attempts to a run a function on that CPU(which is CPU1).
>> CPU0 then blocks with no preempt and irqs off. Thus both CPUs are
>> deadlocked.
>> CPU0 is blocked at smp_call_function_single()
>> CPU1 is blocked at synchronise_count_slave()
>>
>> I am running into this issue with this execution.
>> kernel/power/standby.c: suspend_enter()
>> kernel/power/standby.c: syscore_resume() (Coming out of suspend, only 1
>> cpu up)
>> kernel/time/timekeeping.c: timekeeping_resume() (syscore calls the
>> timekeeping resume hook)
>> kernel/time/hrtimer.c: hrtimer_resume()
>> kernel/time/hrtimer.c: clock_was_set_delayed() (We schedule some work for
>> later)
>> ...
>> CPU0 then starts up CPU1
>> arch/mips/kernel/smp.c: __cpu_up() (Comes here and decides to schedule
>> the hrtimer thread)
>> kernel/time/hrtimer.c: clock_was_set_work()
>> kernel/time/hrtimer.c: clock_was_set()
>> kernel/time/hrtimer.c: on_each_cpu() (and this is where we get screwed)
>> ...
>> kernel/smp.c: smp_call_function_many() (Eventually gets here and
>> blocks if CPU1 already executed set_cpu_online())
>>
>> The deadlock doesn't happen when I test pm with no_console_suspend. I
>> am assuming this happens because things get printed out before
>> "set_cpu_online()" gets executed. Thus delaying the timing. Then CPU0
>> does not see that CPU1 is online when running
>> "smp_call_function_many()".
>>
>> Am I seeing this correctly? What would be the proper fix to this?
>>
>> Thanks,
>> Justin
>>
>




[Index of Archives]     [Linux MIPS Home]     [LKML Archive]     [Linux ARM Kernel]     [Linux ARM]     [Linux]     [Git]     [Yosemite News]     [Linux SCSI]     [Linux Hams]

  Powered by Linux