Hi all, Debian is using Loongson 3A based machines as build daemons. We experience a few stability issues from time to time, with the machine freezing completely, sometimes outputing a backtrace on the serial console: | ------------[ cut here ]------------ | [158285.176000] WARNING: CPU: 3 PID: 4162 at /build/kernel/linux-3.15.5/kernel/smp.c:338 smp_call_function_many+0x120/0x388() | [158285.176000] Modules linked in: radeon drm_kms_helper ttm drm dm_mod ehci_pci ata_generic ohci_pci ohci_hcd ehci_hcd usbcore usb_common | [158285.176000] CPU: 3 PID: 4162 Comm: mysqld Not tainted 3.15-trunk-loongson-3 #1 Debian 3.15.5-1~exp1+rs780e | [158285.176000] Stack : ffffffff80920000 ffffffff80290fec ffffffff80a00000 ffffffff80291808 | 0000000000000000 0000000000000000 ffffffff809e0000 ffffffff809e0000 | ffffffff8085f188 ffffffff80914ff7 ffffffff809de068 98000000fab16e58 | 0000000000001042 0000000000000003 0000000000000003 0000000000000001 | ffffffff8090e688 ffffffff80768cfc 980000014c323c08 ffffffff80234f2c | ffffffff8090e688 ffffffff80293180 98000000fab169b0 ffffffff8085f188 | 0000000000000003 0000000000001042 0000000000000000 0000000000000000 | 0000000000000000 980000014c323b50 0000000000000000 ffffffff8076bdb0 | 0000000000000000 0000000000000000 0000000000000000 ffffffff802b2300 | 0000000000000152 ffffffff8020acd0 0000000000000009 ffffffff8076bdb0 | ... | [158285.280000] Call Trace: | [158285.280000] [<ffffffff8020acd0>] show_stack+0x68/0x80 | [158285.280000] [<ffffffff8076bdb0>] dump_stack+0x6c/0x8c | [158285.280000] [<ffffffff80235088>] warn_slowpath_common+0x88/0xb8 | [158285.280000] [<ffffffff802b2328>] smp_call_function_many+0x120/0x388 | [158285.280000] [<ffffffff802b25bc>] smp_call_function+0x2c/0x40 | [158285.280000] [<ffffffff80223b18>] r4k_flush_data_cache_page+0x38/0x70 | [158285.280000] [<ffffffff803c89b0>] aio_complete+0x170/0x338 | [158285.280000] [<ffffffff803c9bb0>] do_io_submit+0x378/0x768 | [158285.280000] [<ffffffff80218fe8>] handle_sys+0x128/0x14c | [158285.280000] | [158285.280000] ---[ end trace 97d7fd09bd30b5b9 ]--- We noticed this happens on various CPU. The CPU is stuck in this part of the smp_call_function_many function: | if (wait) { | for_each_cpu(cpu, cfd->cpumask) { | struct call_single_data *csd; | | csd = per_cpu_ptr(cfd->csd, cpu); | csd_lock_wait(csd); | } | } and more precisely in the csd_lock_wait() part. From time to time (it *seems* when the initial issue happens on a different CPU than #0), we get this kind of additional backtrace a few seconds after, sometimes repeating regularly on the other CPUs: | [158313.196000] INFO: rcu_sched self-detected stall on CPU { 2} (t=5250 jiffies g=661863 c=661862 q=4) | [158313.196000] CPU: 2 PID: 4217 Comm: mysqld Tainted: G W 3.15-trunk-loongson-3 #1 Debian 3.15.5-1~exp1+rs780e | [158313.196000] Stack : ffffffff80920000 ffffffff80290fec ffffffff80a00000 ffffffff80291808 | 0000000000000000 0000000000000000 ffffffff809e0000 ffffffff809e0000 | ffffffff8085f188 ffffffff80914ff7 ffffffff809de068 98000000029b73e0 | 0000000000001079 0000000000000002 ffffffff80910000 0000000000000010 | 9800000008d4cbe0 ffffffff80768cfc 98000001305d3858 ffffffff80234e94 | 9800000008d51230 ffffffff80293180 98000000029b6f38 ffffffff8085f188 | 0000000000000002 0000000000001079 0000000000000000 0000000000000000 | 0000000000000000 98000001305d37a0 0000000000000000 ffffffff8076bdb0 | 0000000000000000 0000000000000000 0000000000000000 ffffffff80790000 | ffffffff80920c40 ffffffff8020acd0 ffffffff80920c40 ffffffff8076bdb0 | ... | [158313.196000] Call Trace: | [158313.196000] [<ffffffff8020acd0>] show_stack+0x68/0x80 | [158313.196000] [<ffffffff8076bdb0>] dump_stack+0x6c/0x8c | [158313.196000] [<ffffffff8029ff60>] rcu_check_callbacks+0x4d8/0x878 | [158313.196000] [<ffffffff80245418>] update_process_times+0x48/0x88 | [158313.196000] [<ffffffff802ac178>] tick_sched_handle.isra.15+0x20/0x80 | [158313.196000] [<ffffffff802ac218>] tick_sched_timer+0x40/0x70 | [158313.196000] [<ffffffff8025f050>] __run_hrtimer+0xa8/0x240 | [158313.196000] [<ffffffff8025fc08>] hrtimer_interrupt+0x130/0x2f8 | [158313.196000] [<ffffffff8020d754>] c0_compare_interrupt+0x54/0x90 | [158313.196000] [<ffffffff80293eb8>] handle_irq_event_percpu+0x68/0x248 | [158313.196000] [<ffffffff802984fc>] handle_percpu_irq+0x8c/0xc0 | [158313.196000] [<ffffffff802933bc>] generic_handle_irq+0x3c/0x58 | [158313.196000] [<ffffffff80207608>] do_IRQ+0x18/0x30 | [158313.196000] [<ffffffff80205428>] ret_from_irq+0x0/0x4 | [158313.196000] [<ffffffff802b2500>] smp_call_function_many+0x2f8/0x388 | [158313.196000] [<ffffffff802b25bc>] smp_call_function+0x2c/0x40 | [158313.196000] [<ffffffff8020f8a0>] flush_tlb_mm+0x50/0x108 | [158313.196000] [<ffffffff80335a5c>] tlb_finish_mmu+0x74/0x88 | [158313.196000] [<ffffffff8033fef0>] unmap_region+0xc8/0x118 | [158313.196000] [<ffffffff803422c4>] do_munmap+0x264/0x440 | [158313.196000] [<ffffffff803424e4>] vm_munmap+0x44/0x70 | [158313.196000] [<ffffffff8034353c>] SyS_munmap+0x24/0x38 | [158313.196000] [<ffffffff80218fe8>] handle_sys+0x128/0x14c | [158313.196000] | [158340.156000] BUG: soft lockup - CPU#2 stuck for 22s! [mysqld:4217] | [158340.156000] Modules linked in: radeon drm_kms_helper ttm drm dm_mod ehci_pci ata_generic ohci_pci ohci_hcd ehci_hcd usbcore usb_common Any idea about the problem or how to debug that further? The problem happens with both Lemote and Loongson machines, and we have finally found a way to reproduce it all the time, by running the mysql testsuite with 4 threads. This means we can now easily reproduce the issue to debug it further. If someone is interested, I think I can package a chroot with all the needed files in a tarball so that the issue can be reproduce more easily. Thanks, Aurelien -- Aurelien Jarno GPG: 4096R/1DDD8C9B aurelien@xxxxxxxxxxx http://www.aurel32.net