On Mon, 2013-11-25 at 02:48 -0500, George Spelvin wrote: > I first reported this in mid-October, but I've been AFK for a month > and haven't done anything about it in that time. Basically, sustained > linear reads from 6 (7200 RPM 2 TB) disks on a BR10i controller causes > a hard lockup. > > Anyway, I recompiled with CONFIG_LOCKUP_DETECTOR, and it didn't take > long to capture this (hand-transcribed, but double-checked). I omitted > most of the timestamps, as they're not very interesting, but I uncluded > a few at the end that had significant delays between them. > > Does anyone have any ideas for where to start debugging this? The reason for the lack of replies is that no-one has much of an idea. This really looks like a hardware problem. The qi_submit_sync() is suggestive: it's the intel IOMMU mapping call ... have you tried reproducing this with the iommu disabled? James > Thank you very much! > > [ 321.243221] ------------[ cut here ]------------ > WARNING: CPU: 0 PID: 0 at kernel.watchdog.c:245 watchdog_overflow_callback+0x9a/0xc0() > Watchdog detected hard LOCKUP on cpu 0 > Modules linked in: twofish_generic twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common ecb cmac xcbc fuse > CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.12.1-00045-g27b879d64d #306 > Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./X79-UP4, BIOS F2 07/16/2012 > 0000000000000009 ffff88043fc06c40 ffffffff815d0ee9 ffff88043fc06c88 > ffff88043fc06c78 ffffffff8104fef3 ffff88042d816800 0000000000000000 > ffff88043fc06da0 0000000000000000 ffff88043fc06ef8 ffff88043fc06cd8 > Call Trace: > <NMI> [<ffffffff815d0ee9>] dump_stack+0x54/0x74 > [<ffffffff8104fef3>] warn_slowpath_common+0x73/0x90 > [<ffffffff8104ff57>] warn_slowpath_fmt+0x47/0x50 > [<ffffffff810bc990>] ? restart_watchdog_hrtimer+0x40/0x40 > [<ffffffff810bca2a>] watchdog_overflow_callback+0x9a/0xc0 > [<ffffffff810c924e>] __perf_event_overflow+0x8e/0x2c0 > [<ffffffff810c9c44>] perf_event_overflow+0x14/0x20 > [<ffffffff8101be36>] intel_pmu_handle_irq+0x1b6/0x390 > [<ffffffff810150cb>] perf_event_nmi_handler+0x2b/0x50 > [<ffffffff81006857>] nmi_handle.isra.3+0x87/0x140 > [<ffffffff810069e0>] do_nmi+0xd0/0x340 > [<ffffffff815d9ab7>] end_repeat_nmi+0x1e/0x2e > [<ffffffff815d9161>] ? _raw_spin_lock+0x11/0x40 > [<ffffffff815d9161>] ? _raw_spin_lock+0x11/0x40 > [<ffffffff815d9161>] ? _raw_spin_lock+0x11/0x40 > <<EOE>> <IRQ> [<ffffffff814dbc2a>] ? qi_submit_sync+0x28a/0x450 > [<ffffffff813b1e1d>] ? scsi_run_queue+0x11d/0x280 > [<ffffffff814dbeca>] qi_flush_iotlb+0x5a/0x60 > [<ffffffff814dce9a>] flush_unmaps+0x15a/0x170 > [<ffffffff814dceb0>] ? flush_unmaps+0x170/0x170 > [<ffffffff814dcec9>] flush_unmaps_timeout+0x19/0x30 > [<ffffffff8105a7c2>] call_timer_fn.isra.29+0x22/0x80 > [<ffffffff8105a9d9>] run_timer_softirq+0x1b9/0x290 > [<ffffffff8120cc00>] ? timerqueue_add+0x60/0xb0 > [<ffffffff810546c9>] __do_softirq+0xd9/0x1a0 > [<ffffffff815daf7c>] call_softirq+0x1c/0x30 > [<ffffffff81004d75>] do_softirq+0x35/0x70 > [<ffffffff810548e5>] irq_exit+0x95/0xa0 > [<ffffffff8102c08f>] smp_apic_timer_interrupt+0x3f/0x50 > [<ffffffff815da90a>] apic_timer_interrupt+0x6a/0x70 > <EOI> [<ffffffff81070b52>] ? __hrtimer_start_range_ns+0x1f2/0x3b0 > [<ffffffff814ca1c7>] ? cpuidle_enter_state+0x47/0xc0 > [<ffffffff814ca1c3>] ? cpuidle_enter_state+0x43/0xc0 > [<ffffffff814ca2e9>] cpuidle_idle_call+0xa9/0x150 > [<ffffffff8100bed9>] arch_cpu_idle+0x9/0x20 > [<ffffffff8109619e>] cpu_startup_entry+0x7e/0x170 > [<ffffffff815c97eb>] rest_init+0x8b/0x90 > [<ffffffff81ab5d35>] start_kernel+0x2d9/0x2e4 > [<ffffffff81ab5865>] ? repair_env_string+0x5c/0x5c > [<ffffffff81ab55a3>] x86_64_start_reservations+0x2a/0x2c > [<ffffffff81ab566c>] x86_64_start_kernel+0xc7/0xca > [ 321.271385] ---[ end trace e25797a0833ba41e ]--- > [ 321.272175] perf samples too long (226338 > 2500), lowering kernel.perf_event_max_sample_rate to 50100 > [ 321.272986] INFO: NMI handler (perf_event_nmi_handler_ took too long to run: 29.766 msecs > [ 329.848706] perf samples too long (224588 > 4990), lowering kernel.perf_event_max_sample_rate to 25200 > [ 338.553847] perf samples too long (222847 > 9920), lowering kernel.perf_event_max_sample_rate to 12600 > [ 339.993145] mptscsih: ioc0: attampting task abort! (sc=ffff880422009d00) > [ 339.993331] sd 14:0:3:0: [sdf] CDB: > [ 339.993603] Read(10): 28 00 01 fa 8d 00 00 04 00 00 > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html