[RFC] Timing hazard in arch/mips/kernel/smp.c:start_secondary

Justin Chen <justinpopo6@xxxxxxxxx> · Wed, 21 Sep 2016 14:32:12 -0700

Hello everyone,

I am running into a deadlock while testing bmips power management
code. Currently attempting to add power management functionality to
arch/mips/configs/bmips_stb_defconfig. The kernel locks up when coming
back from a suspend state. I am working on a bcm7435 board.

In arch/mips/kernel/smp.c:start_secondary
---
asmlinkage void start_secondary(void)
{
        ....
        set_cpu_online(cpu, true);

        set_cpu_sibling_map(cpu);
        set_cpu_core_map(cpu);

        calculate_cpu_foreign_map();

        cpumask_set_cpu(cpu, &cpu_callin_map);

        synchronise_count_slave(cpu);
        ....
}
---
The deadlock occurs because the set_cpu_online() is called before
synchronise_count_slave(). This can cause a deadlock if the boot cpu
sees that the secondary cpu is online and tries to execute a function
on it before it synchronizes with it. The boot cpu ends up waiting for
the secondary cpu to execute a function, while the secondary cpu waits
for the boot cpu to synchronise with it.

Lets assume the following occurs.

1. CPU0 starts CPU1. CPU0 starts waiting for CPU1 to start up.
CPU0 ends up at arch/mips/kernel/smp.c:__cpu_up().
---
int __cpu_up(unsigned int cpu, struct task_struct *tidle)
{
        mp_ops->boot_secondary(cpu, tidle);

        /*
         * Trust is futile.  We should really have timeouts ...
         */
        while (!cpumask_test_cpu(cpu, &cpu_callin_map)) {
                udelay(100);
                schedule();
        }

        synchronise_count_master(cpu);
        return 0;
}
---
While CPU0 waits for CPU1 it schedules another thread.

2. CPU0 begins executing a new thread and eventually ends up at
kernel/smp.c:smp_call_function_many()
---
void smp_call_function_many(const struct cpumask *mask,
                            smp_call_func_t func, void *info, bool wait)
{
        ....
        /* No online cpus?  We're done. */
        if (cpu >= nr_cpu_ids)
                return;

        /* Do we have another CPU which isn't us? */
        next_cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
        if (next_cpu == this_cpu)
                next_cpu = cpumask_next_and(next_cpu, mask, cpu_online_mask);

        /* Fastpath: do that cpu by itself. */
        if (next_cpu >= nr_cpu_ids) {
                smp_call_function_single(cpu, func, info, wait);
                return;
        }
        ....
}
---
3. CPU1 executes set_cpu_online() and blocks at
synchronise_count_slave(). Thus CPU1 is blocked, however it tells
everyone it is online.

4. CPU0(in kernel/smp.c:smp_call_function_many()) sees that one CPU is
online and attempts to a run a function on that CPU(which is CPU1).
CPU0 then blocks with no preempt and irqs off. Thus both CPUs are
deadlocked.
CPU0 is blocked at smp_call_function_single()
CPU1 is blocked at synchronise_count_slave()

I am running into this issue with this execution.
kernel/power/standby.c: suspend_enter()
kernel/power/standby.c: syscore_resume() (Coming out of suspend, only 1 cpu up)
kernel/time/timekeeping.c: timekeeping_resume() (syscore calls the
timekeeping resume hook)
kernel/time/hrtimer.c: hrtimer_resume()
kernel/time/hrtimer.c: clock_was_set_delayed() (We schedule some work for later)
...
CPU0 then starts up CPU1
arch/mips/kernel/smp.c: __cpu_up() (Comes here and decides to schedule
the hrtimer thread)
kernel/time/hrtimer.c: clock_was_set_work()
kernel/time/hrtimer.c: clock_was_set()
kernel/time/hrtimer.c: on_each_cpu() (and this is where we get screwed)
...
kernel/smp.c: smp_call_function_many() (Eventually gets here and
blocks if CPU1 already executed set_cpu_online())

The deadlock doesn't happen when I test pm with no_console_suspend. I
am assuming this happens because things get printed out before
"set_cpu_online()" gets executed. Thus delaying the timing. Then CPU0
does not see that CPU1 is online when running
"smp_call_function_many()".

Am I seeing this correctly? What would be the proper fix to this?

Thanks,
Justin