On 9/19/2012 1:52 PM, Nix wrote: > So I have this x86-64 server running Linux 3.5.1 When did you install 3.5.1 on this machine? If fairly recently, does it run without these errors when booted into the previous kernel? > with a SATA-on-PCIe > Areca 1210 hardware RAID-5 controller driven by libata which has been > humming along happily for years -- but suddenly, today, the entire > machine froze for a couple of minutes (or at least fs access froze), > followed by this in the logs: > > Sep 19 16:55:47 spindle notice: [3447524.381843] arcmsr0: abort device command of scsi id = 0 lun = 1 > [... repeated a few times at intervals over the next five minutes, > followed by a mass of them at 16:59:29, and...] > Sep 19 16:59:25 spindle err: [3447657.821450] arcmsr: executing bus reset eh.....num_resets = 0, num_aborts = 33 > Sep 19 16:59:25 spindle notice: [3447697.878386] arcmsr0: wait 'abort all outstanding command' timeout > Sep 19 16:59:25 spindle notice: [3447697.878628] arcmsr0: executing hw bus reset ..... > Sep 19 16:59:25 spindle err: [3447698.287054] irq 16: nobody cared (try booting with the "irqpoll" option) > Sep 19 16:59:25 spindle warning: [3447698.287291] Pid: 0, comm: swapper/4 Not tainted 3.5.1-dirty #1 > Sep 19 16:59:25 spindle warning: [3447698.287522] Call Trace: > Sep 19 16:59:25 spindle warning: [3447698.287754] <IRQ> [<ffffffff810af5ba>] __report_bad_irq+0x31/0xc2 > Sep 19 16:59:25 spindle warning: [3447698.288031] [<ffffffff810af84e>] note_interrupt+0x16a/0x1e8 > Sep 19 16:59:25 spindle warning: [3447698.288263] [<ffffffff810ad9d5>] handle_irq_event_percpu+0x163/0x1a5 > Sep 19 16:59:25 spindle warning: [3447698.288497] [<ffffffff810ada4f>] handle_irq_event+0x38/0x55 > Sep 19 16:59:25 spindle warning: [3447698.288727] [<ffffffff810b01a0>] handle_fasteoi_irq+0x78/0xab > Sep 19 16:59:25 spindle warning: [3447698.288960] [<ffffffff8103631c>] handle_irq+0x24/0x2a > Sep 19 16:59:25 spindle warning: [3447698.289189] [<ffffffff81036229>] do_IRQ+0x4d/0xb4 > Sep 19 16:59:25 spindle warning: [3447698.289419] [<ffffffff815070e7>] common_interrupt+0x67/0x67 > Sep 19 16:59:25 spindle warning: [3447698.289648] <EOI> [<ffffffff812ab174>] ? acpi_idle_enter_c1+0xcb/0xf2 > Sep 19 16:59:25 spindle warning: [3447698.289919] [<ffffffff812ab152>] ? acpi_idle_enter_c1+0xa9/0xf2 > Sep 19 16:59:25 spindle warning: [3447698.290152] [<ffffffff813c1446>] cpuidle_enter+0x12/0x14 > Sep 19 16:59:25 spindle warning: [3447698.290382] [<ffffffff813c1902>] cpuidle_idle_call+0xc5/0x175 > Sep 19 16:59:25 spindle warning: [3447698.290614] [<ffffffff8103c2da>] cpu_idle+0x5b/0xa5 > Sep 19 16:59:25 spindle warning: [3447698.290844] [<ffffffff81ad4fcb>] start_secondary+0x1a2/0x1a6 > Sep 19 16:59:25 spindle err: [3447698.291074] handlers: > Sep 19 16:59:25 spindle err: [3447698.291294] [<ffffffff8133b9a3>] usb_hcd_irq > Sep 19 16:59:25 spindle emerg: [3447698.291553] Disabling IRQ #16 > Sep 19 16:59:25 spindle err: [3447710.888187] arcmsr0: waiting for hw bus reset return, retry=0 > Sep 19 16:59:25 spindle err: [3447720.882155] arcmsr0: waiting for hw bus reset return, retry=1 > Sep 19 16:59:25 spindle notice: [3447730.896410] Areca RAID Controller0: F/W V1.46 2009-01-06 & Model ARC-1210 > Sep 19 16:59:25 spindle err: [3447730.916348] arcmsr: scsi bus reset eh returns with success > > This is the first SCSI (that is, um, ATA) bus reset I have *ever* had on > this machine, hence my concern. (The IRQ disable we can ignore: it was > just bad luck that an interrupt destined for the Areca hit after the > controller had briefly vanished from the PCI bus as part of resetting.) > > Now just last week another (surge-protected) machine on the same power > main as it died without warning with a fried power supply which > apparently roasted the BIOS and/or other motherboard components before > it died (the ACPI DSDT was filled with rubbish, and other things must > have been fried because even with ACPI off Linux wouldn't boot more than > one time out of a hundred (freezing solid at different places in the > boot each time). So my worry level when this SCSI bus reset turned up > today is quite high. It's higher given that the controller logs > (accessed via the Areca binary-only utility for this purpose) show no > sign of any problem at all. > > EDAC shows no PCI bus problems and no memory problems, so this probably > *is* the controller. > > So... is this a serious problem? Does anyone know if I'm about to lose > this controller, or indeed machine as well? (I really, really hope not.) > > I'd write this off as a spurious problem and not report it at all, but > I'm jittery as heck after the catastrophic hardware failure last week, > and when this happens in close proximity, I worry. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html