Hi, I'm having problems when using smartmontools with SATA disks behind an LSI SAS controller. The machine is a Dell PowerEdge 1950-II, the controller in question: 02:08.0 SCSI storage controller [0100]: LSI Logic / Symbios Logic SAS1068 PCI-X Fusion-MPT SAS [1000:0054] (rev 01) Subsystem: Dell SAS 5/i Integrated Controller [1028:1f06] Flags: bus master, 66MHz, medium devsel, latency 72, IRQ 1270 I/O ports at ec00 [disabled] [size=256] Memory at fc8fc000 (64-bit, non-prefetchable) [size=16K] Memory at fc8e0000 (64-bit, non-prefetchable) [size=64K] Expansion ROM at fc900000 [disabled] [size=1M] Capabilities: [50] Power Management version 2 Capabilities: [98] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+ Capabilities: [68] PCI-X non-bridge device Capabilities: [b0] MSI-X: Enable- Mask- TabSize=1 Kernel driver in use: mptsas Kernel modules: mptsas History: - The machine was running with kernel 2.6.22 and smartmontools 5.37 & 5.38 (from Debian) for a long time. smartd occassionally complained about "Device: /dev/sdX, not capable of SMART self-check", but other than that the machine was stable. smartd configuration: /dev/sda -d sat -a -s (L/../../4/03|S/../.././02|O/../../6/03) -m root -I 190 -I 194 /dev/sdb -d sat -a -s (L/../../4/03|S/../.././02|O/../../6/03) -m root -I 190 -I 194 sda is a Samsung HD160JJ, sdb is a Seagate ST3160812AS (oh well). - After switching to 2.6.26 (from Debian Lenny), running smartd started to cause the disks to go offline in a couple of hours after boot. Log sample: Sep 7 08:50:36 gw kernel: [4917120.304690] mptscsih: ioc0: attempting task abort! (sc=ffff81007ff26940) Sep 7 08:50:36 gw kernel: [4917120.304690] sd 0:0:1:0: [sdb] CDB: ATA command pass through(16): 85 08 0e 00 00 00 01 00 00 00 00 00 00 00 ec 00 Sep 7 08:50:40 gw kernel: [4917126.213130] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) Sep 7 08:50:40 gw kernel: [4917126.215970] mptsas: ioc0: removing sata device, channel 0, id 1, phy 1 Sep 7 08:50:40 gw kernel: [4917126.215974] port-0:1: mptsas: ioc0: delete port (1) Sep 7 08:50:40 gw kernel: [4917126.216570] sd 0:0:1:0: [sdb] Synchronizing SCSI cache Sep 7 08:50:40 gw kernel: [4917126.563597] mptscsih: ioc0: task abort: SUCCESS (sc=ffff81007ff26940) Sep 7 08:50:40 gw kernel: [4917126.563606] mptscsih: ioc0: attempting task abort! (sc=ffff81007ff26bc0) Sep 7 08:50:40 gw kernel: [4917126.563609] sd 0:0:1:0: [sdb] CDB: Write(10): 2a 00 01 49 f2 98 00 00 08 00 Sep 7 08:50:40 gw kernel: [4917126.563617] mptscsih: ioc0: task abort: SUCCESS (sc=ffff81007ff26bc0) Sep 7 08:50:40 gw kernel: [4917126.563623] mptscsih: ioc0: attempting target reset! (sc=ffff81007ff26940) Sep 7 08:50:40 gw kernel: [4917126.563625] sd 0:0:1:0: [sdb] CDB: ATA command pass through(16): 85 08 0e 00 00 00 01 00 00 00 00 00 00 00 ec 00 Sep 7 08:50:40 gw kernel: [4917126.897143] mptscsih: ioc0: target reset: SUCCESS (sc=ffff81007ff26940) Sep 7 08:50:40 gw kernel: [4917126.897143] mptscsih: ioc0: attempting bus reset! (sc=ffff81007ff26940) Sep 7 08:50:40 gw kernel: [4917126.897143] sd 0:0:1:0: [sdb] CDB: ATA command pass through(16): 85 08 0e 00 00 00 01 00 00 00 00 00 00 00 ec 00 Sep 7 08:50:44 gw kernel: [4917131.074580] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff81007ff26940) Sep 7 08:50:54 gw kernel: [4917145.159523] mptscsih: ioc0: attempting host reset! (sc=ffff81007ff26940) Sep 7 08:50:54 gw kernel: [4917145.163513] mptbase: ioc0: Initiating recovery Sep 7 08:51:10 gw kernel: [4917167.457273] mptscsih: ioc0: host reset: SUCCESS (sc=ffff81007ff26940) Sep 7 08:51:10 gw kernel: [4917167.457279] sd 0:0:1:0: Device offlined - not ready after error recovery Sep 7 08:51:10 gw kernel: [4917167.457282] sd 0:0:1:0: Device offlined - not ready after error recovery Sep 7 08:51:10 gw kernel: [4917167.457350] sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK Sep 7 08:51:10 gw kernel: [4917167.457357] end_request: I/O error, dev sdb, sector 21623448 Sep 7 08:51:10 gw kernel: [4917167.457364] raid1: Disk failure on sdb6, disabling device. Sep 7 08:51:10 gw kernel: [4917167.457365] raid1: Operation continuing on 1 devices. Sep 7 08:51:10 gw kernel: [4917167.457388] end_request: I/O error, dev sdb, sector 1959743 Sep 7 08:51:10 gw kernel: [4917167.457393] md: super_written gets error=-5, uptodate=0 Sep 7 08:51:10 gw kernel: [4917167.457398] raid1: Disk failure on sdb1, disabling device. Sep 7 08:51:22 gw kernel: [4917167.457399] raid1: Operation continuing on 1 devices. Sep 7 08:51:22 gw kernel: [4917167.457411] end_request: I/O error, dev sdb, sector 21478687 Sep 7 08:51:22 gw kernel: [4917167.457415] md: super_written gets error=-5, uptodate=0 Sep 7 08:51:22 gw kernel: [4917167.457420] raid1: Disk failure on sdb5, disabling device. Sep 7 08:51:22 gw kernel: [4917167.457421] raid1: Operation continuing on 1 devices. Sep 7 08:51:22 gw kernel: [4917167.461613] sd 0:0:1:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK Sep 7 08:51:22 gw kernel: [4917167.526799] raid1: Disk failure on sdb2, disabling device. Sep 7 08:51:22 gw kernel: [4917167.526801] raid1: Operation continuing on 1 devices. After such an error I have to manually remove and re-insert the drive to make the controller detect it again. - Upgrading to 2.6.30 (from Debian Sid) did not help. - Upgrading the controller firmware to the latest version available from Dell (the driver reports: FwRev=000a3300h) did not help. - I've found this thread: http://marc.info/?l=smartmontools-support&m=122518510306493&w=2 It claimed that a similar bug has been fixed in smartd in CVS HEAD as of 2008-10-30, so I've upgraded to smartmontools 5.38+svn2879-4 from Debian Sid (smartctl -V gives: smartctl 5.39 2009-08-29 r2879), but that also did not help. Is this a kernel bug (2.6.22 at least did not drop the disks), or a bug in smartmontools? Gabor -- --------------------------------------------------------- MTA SZTAKI Computer and Automation Research Institute Hungarian Academy of Sciences --------------------------------------------------------- -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html