Re: ia64 won't boot because of rcu_sched self-detected stall

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Aug 21, 2012 at 03:45:52PM -0700, Tony Luck wrote:
> last few linux-next builds (at least from next-20120806 to
> next-20120821) fail to boot on ia64 with these messages:
> 
>  ...
> Total of 2 processors activated (3891.20 BogoMIPS).
> SMBIOS 2.3 present.
> NET: Registered protocol family 16
> ACPI: bus type pci registered
> bio: create slab <bio-0> at 0
> ACPI: Added _OSI(Module Device)
> ACPI: Added _OSI(Processor Device)
> ACPI: Added _OSI(3.0 _SCP Extensions)
> ACPI: Added _OSI(Processor Aggregator Device)
> INFO: rcu_sched self-detected stall on CPUINFO: rcu_sched detected
> stalls on CPUs/tasks: { 1} (detected by 0, t=15002 jiffies)

Two of them simultaneously, it appears.  Looks like CPU 1 detected itself
about the same time that CPU 0 detected it.  Which is a bit strange,
given that the code gives the CPU a couple of jiffies to complain about
itself before other CPUs are allowed to complain.  Was this a guest OS?
Is the system heavily loaded during boot?

> Call Trace:
>  [<a0000001000154e0>] show_stack+0x80/0xa0
>                                 sp=a000000101247a20 bsp=a0000001012410b0
>  [<a000000100d76df0>] dump_stack+0x30/0x50
>                                 sp=a000000101247bf0 bsp=a000000101241098
>  [<a0000001001311e0>] print_other_cpu_stall+0x440/0x540
>                                 sp=a000000101247bf0 bsp=a000000101241038
>  [<a000000100131860>] rcu_check_callbacks+0x580/0x780
>                                 sp=a000000101247bf0 bsp=a000000101240fd8
>  [<a0000001000967d0>] update_process_times+0x50/0xa0
>                                 sp=a000000101247bf0 bsp=a000000101240fa8
>  [<a00000010003f7f0>] timer_interrupt+0x130/0x3c0
>                                 sp=a000000101247bf0 bsp=a000000101240f58
>  [<a00000010011ee00>] handle_irq_event_percpu+0x80/0x460
>                                 sp=a000000101247bf0 bsp=a000000101240ef0
>  [<a0000001001266c0>] handle_percpu_irq+0x120/0x1a0
>                                 sp=a000000101247bf0 bsp=a000000101240ec0
>  [<a00000010011e0e0>] generic_handle_irq+0xa0/0xe0
>                                 sp=a000000101247bf0 bsp=a000000101240ea0
>  [<a0000001000117b0>] ia64_handle_irq+0x110/0x440
>                                 sp=a000000101247bf0 bsp=a000000101240e18
>  [<a00000010000c140>] ia64_native_leave_kernel+0x0/0x270
>                                 sp=a000000101247bf0 bsp=a000000101240e18
>  [<a000000100013bf0>] default_idle+0x130/0x180
>                                 sp=a000000101247dc0 bsp=a000000101240dd0
>  [<a000000100014ee0>] cpu_idle+0x1c0/0x3c0
>                                 sp=a000000101247e20 bsp=a000000101240d68
>  [<a000000100d379e0>] rest_init+0xe0/0x100
>                                 sp=a000000101247e20 bsp=a000000101240d50
>  [<a0000001010f9360>] start_kernel+0x760/0x870
>                                 sp=a000000101247e20 bsp=a000000101240cd0
>  [<a000000100d378c0>] start_ap+0x760/0x780
>                                 sp=a000000101247e30 bsp=a000000101240bc0

And it looks like the stall was from the idle loop.  Except isn't this
CPU 0's stack, rather than that of the stalled CPU 1?

Having CPU 1's stack would of course be much more helpful.

>  { 1}  (t=15613 jiffies)

This is the tail end of CPU 1's stall-warning message.

> Call Trace:
> 
> ... stack traces keep coming every 64 seconds.

Hmmm... Is the immediately following stack trace that of CPU 1?
That would likely be quite helpful.  If not, please see below.

> Any clues on where I should look to debug this?

CONFIG_RCU_CPU_STALL_INFO=y should give more information on why RCU
thought a stall warning should be printed.  In addition, you might
bump up the value of RCU_STALL_RAT_DELAY (defined in kernel/rcutree.h)
in order to give CPU 0 more time to complain about itself -- hopefully
getting its stack rather than that of an innocent bystander.  It should
be no problem setting RCU_STALL_RAT_DELAY to HZ or perhaps even higher.

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-next" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel]     [Linux USB Development]     [Yosemite News]     [Linux SCSI]

  Powered by Linux