Good morning hope the day is going well for everyone. Apologies for the large broadcast domain on this. I wanted to make sure everyone who may have an interest in this is involved. Some feedback on another issue we encountered with Linux in a production initiator/target environment with SCST. I'm including logs below from three separate systems involved in the incident. I've gone through them with my team and we are currently unsure on what triggered all this, hence mail to everyone who may be involved. The system involved is SCST 1.0.0.0 running on a Linux 2.6.24.7 target platform using the qla_isp driver module. The target machine has two 9650 eight port 3Ware controller cards driving a total of 16 750 gigabyte Seagate NearLine drives. Firmware on the 3ware and Qlogic cards should all be current. There are two identical servers in two geographically separated data-centers. The drives on each platform are broken into four 3+1 RAID5 devices with software RAID. Each RAID5 volume is a physical volume for an LVM volume group. There is currently one logical volume exported from each of four RAID5 volumes as a target device. A total of four initiators are thus accessing the target server, each accessing different RAID5 volumes. The initiators are running a stock 2.6.26.2 kernel with a RHEL5 userspace. Access to the SAN is via a 2462 dual-port Qlogic card. The initiators see a block device from each of the two target servers through separate ports/paths. The block devices form a software RAID1 device (with bitmaps) which is the physical volume for an LVM volume group. The production filesystem is supported by a single logical volume allocated from that volume group. A drive failure occured last Sunday afternoon on one of the RAID5 volumes. The target kernel recognized the failure, failed the device and kept going. Unfortunately three of the four initiators picked up a device failure which caused the SCST exported volume to be faulted out of the RAID1 device. One of the initiators noted an incident was occurring, issued a target reset and continued forward with no issues. The initiator which got things 'right' was not accessing the RAID5 volume on the target which experienced the error. Two of the three initiators which faulted out their volumes were not accessing the compromised RAID5 volume. The initiator accessing the volume faulted out its device. In the logs below the 'init1' initiator was the one which did not fail its device. The init2 log is an example log from the initiators which failed out their devices, behavior seemed to be identical on all the initiators which faulted their block devices. The log labelled target are the log entries from the event on the SCST server. All three servers from which logs were abstracted were NTP time synchronized so log timings are directly correlatable. Some items to note: --- The following log message from the 3Ware driver seems bogus with respect to the port number. Doubtfull this has anything to do with the incident but may be of interest to the 3Ware people copied on this note: Aug 17 17:55:16 scst-target kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x000A): Drive error detected:unit=2, port=-2147483646. --- The initiators which received I/O errors had the Qlogic driver attempt a 'DEVICE RESET' which failed and was then retried. The second reset attempt succeeded. The 3Ware driver elected to reset the card at 17:55:32. A period of 44 seconds elapses from that message until end_request picks up on the I/O error which causes the RAID5 driver to fault the affected drive. The initiators which failed their 'DEVICE RESET' issued their failed requests during this time window. Of interest to Vlad may be the following log entry(s): Aug 17 17:56:07 init2 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): DEVICE RESET FAILED: Task management failed. The initiator which had its 'DEVICE RESET' succeed issued the reset after the above window with a timestamp identical to that of the end_request I/O error message on the target. --- Of interest to NeilB and why I copied him as well is the following. Precisely one minute after the second attempt to reset the target succeeeds the kernel indicates the involved RAID1 kthread has blocked for more than 120 seconds. The call trace indicates the thread was waiting on a RAID superblock update. Immediately after the kernel finishes issueing the message and stack trace the Qlogic driver attempts to abort a SCSI command which results in end_request getting an I/O error which causes the device to be faulted from the RAID1 device. This occurs one full minute AFTER the target RAID5 device has had its device evicted and is continuing in normal but degraded operation. --- Empirically it would seem the initiators which were 'unlucky' happened to issue their 'DEVICE RESET' requests while the SCST service thread they were assigned to was blocked waiting for the 3Ware card to reset. What is unclear is why the initiator I/O error was generated after the reset succeeded the second time, a minute after the incident was completely over as far as the SCST target server was concerned. A question for Vlad. The SCST target server is a dual-processor SMP box with the default value of two kernel threads active. Would it be advantageous to increase this value to avoid situations like this? Would an appropriate metric be to have the number of active threads equal to the number of exported volumes or initiators? I would be interested in any ideas the group may have. Let me know if I can provide additional information or documentation on any of this. Best wishes for a productive remainder of the week. GW Logs follow. init1 (good) -------------------------------------------------------------- Aug 17 17:56:02 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 2e334 2002. Aug 17 17:56:08 init1 sshd[14485]: Did not receive identification string from 134.129.111.71 Aug 17 17:56:13 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 2e334 2002. Aug 17 17:56:13 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 2e335 2002. Aug 17 17:56:16 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 2e336 2002. Aug 17 17:56:16 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 2e337 2002. Aug 17 17:56:16 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 2e338 2002. Aug 17 17:56:16 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 2e339 2002. Aug 17 17:56:16 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 2e33a 2002. Aug 17 17:56:16 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 2e33b 2002. Aug 17 17:56:16 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 2e33c 2002. Aug 17 17:56:16 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 2e33d 2002. Aug 17 17:56:16 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 2e33e 2002. Aug 17 17:56:16 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 2e33f 2002. Aug 17 17:56:16 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): DEVICE RESET ISSUED. Aug 17 17:56:16 init1 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): DEVICE RESET SUCCEEDED. --------------------------------------------------------------------------- init2 (bad) --------------------------------------------------------------- Aug 17 17:55:37 init2 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 3b848 2002. Aug 17 17:55:47 init2 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 3b848 2002. Aug 17 17:55:47 init2 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): DEVICE RESET ISSUED. Aug 17 17:56:07 init2 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): DEVICE RESET FAILED: Task management failed. Aug 17 17:56:07 init2 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): TARGET RESET ISSUED. Aug 17 17:56:08 init2 sshd[15769]: Did not receive identification string from 134.129.111.71 Aug 17 17:56:16 init2 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): TARGET RESET SUCCEEDED. Aug 17 17:57:08 init2 sshd[15815]: Did not receive identification string from 134.129.111.71 Aug 17 17:57:16 init2 kernel: INFO: task md7_raid1:739 blocked for more than 120 seconds. Aug 17 17:57:16 init2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 17 17:57:16 init2 kernel: md7_raid1 D ffff81025e405e00 5792 739 2 Aug 17 17:57:16 init2 kernel: ffff81025d523c70 0000000000000046 000000000000ffff 000000003e7fff88 Aug 17 17:57:16 init2 kernel: ffff81025d4cd140 ffff81025fe73000 ffff81025d4cd488 0000000400000000 Aug 17 17:57:16 init2 kernel: ffff8101f81018c0 0000000000000000 ffff81025fdfaa80 ffffffff802a382c Aug 17 17:57:16 init2 kernel: Call Trace: Aug 17 17:57:16 init2 kernel: [<ffffffff802a382c>] bio_alloc_bioset+0x89/0xd9 Aug 17 17:57:16 init2 kernel: [<ffffffff803c4d64>] md_super_wait+0xa3/0xb7 Aug 17 17:57:16 init2 kernel: [<ffffffff8023fbb4>] autoremove_wake_function+0x0/0x2e Aug 17 17:57:16 init2 kernel: [<ffffffff803cc017>] write_page+0x1bb/0x2a3 Aug 17 17:57:16 init2 kernel: [<ffffffff80224c00>] find_busiest_group+0x25e/0x6c6 Aug 17 17:57:16 init2 kernel: [<ffffffff803c7e9a>] super_90_sync+0x359/0x367 Aug 17 17:57:16 init2 kernel: [<ffffffff803c5461>] md_update_sb+0x1be/0x291 Aug 17 17:57:16 init2 kernel: [<ffffffff803c9841>] md_check_recovery+0x185/0x4c0 Aug 17 17:57:16 init2 kernel: [<ffffffff80242434>] hrtimer_start+0xfc/0x11e Aug 17 17:57:16 init2 kernel: [<ffffffffa0044091>] :raid1:raid1d+0x25/0xc68 Aug 17 17:57:16 init2 kernel: [<ffffffff80459e32>] thread_return+0x63/0xa4 Aug 17 17:57:16 init2 kernel: [<ffffffff80236d9f>] lock_timer_base+0x26/0x4b Aug 17 17:57:16 init2 kernel: [<ffffffff80236e15>] try_to_del_timer_sync+0x51/0x5a Aug 17 17:57:16 init2 kernel: [<ffffffff80236e2a>] del_timer_sync+0xc/0x16 Aug 17 17:57:16 init2 kernel: [<ffffffff8045a0ff>] schedule_timeout+0x92/0xad Aug 17 17:57:16 init2 kernel: [<ffffffff803cb2e9>] md_thread+0xd7/0xed Aug 17 17:57:16 init2 kernel: [<ffffffff8023fbb4>] autoremove_wake_function+0x0/0x2e Aug 17 17:57:16 init2 kernel: [<ffffffff803cb212>] md_thread+0x0/0xed Aug 17 17:57:16 init2 kernel: [<ffffffff8023fa95>] kthread+0x47/0x76 Aug 17 17:57:16 init2 kernel: [<ffffffff8022aa03>] schedule_tail+0x28/0x5d Aug 17 17:57:16 init2 kernel: [<ffffffff8020cb78>] child_rip+0xa/0x12 Aug 17 17:57:16 init2 kernel: [<ffffffff8023fa4e>] kthread+0x0/0x76 Aug 17 17:57:16 init2 kernel: [<ffffffff8020cb6e>] child_rip+0x0/0x12 Aug 17 17:57:16 init2 kernel: Aug 17 17:57:17 init2 kernel: qla2xxx 0000:0c:00.0: scsi(3:0:0): Abort command issued -- 1 3b849 2002. Aug 17 17:57:17 init2 kernel: end_request: I/O error, dev sdc, sector 1048575880 Aug 17 17:57:17 init2 kernel: md: super_written gets error=-5, uptodate=0 Aug 17 17:57:17 init2 kernel: raid1: Disk failure on sdc1, disabling device. Aug 17 17:57:17 init2 kernel: raid1: Operation continuing on 1 devices. Aug 17 17:57:17 init2 kernel: RAID1 conf printout: Aug 17 17:57:17 init2 kernel: --- wd:1 rd:2 Aug 17 17:57:17 init2 kernel: disk 0, wo:1, o:0, dev:sdc1 Aug 17 17:57:17 init2 kernel: disk 1, wo:0, o:1, dev:sdd1 Aug 17 17:57:17 init2 kernel: RAID1 conf printout: Aug 17 17:57:17 init2 kernel: --- wd:1 rd:2 Aug 17 17:57:17 init2 kernel: disk 1, wo:0, o:1, dev:sdd1 --------------------------------------------------------------------------- target -------------------------------------------------------------------- Aug 17 17:55:13 scst-target kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=2. Aug 17 17:55:16 scst-target kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x000A): Drive error detected:unit=2, port=-2147483646. Aug 17 17:55:18 scst-target kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x0009): Drive timeout detected:port=2. Aug 17 17:55:29 scst-target last message repeated 2 times Aug 17 17:55:32 scst-target kernel: sd 1:0:2:0: WARNING: (0x06:0x002C): Command (0x2a) timed out, resetting card. Aug 17 17:56:16 scst-target kernel: end_request: I/O error, dev sdk, sector 66331127 Aug 17 17:56:16 scst-target kernel: end_request: I/O error, dev sdk, sector 37108159 Aug 17 17:56:16 scst-target kernel: raid5: Disk failure on sdk1, disabling device. Operation continuing on 3 devices Aug 17 17:56:16 scst-target kernel: end_request: I/O error, dev sdk, sector 37109439 Aug 17 17:56:16 scst-target kernel: end_request: I/O error, dev sdk, sector 37108799 Aug 17 17:56:16 scst-target kernel: end_request: I/O error, dev sdk, sector 37108847 Aug 17 17:56:16 scst-target kernel: end_request: I/O error, dev sdk, sector 37109191 Aug 17 17:56:16 scst-target kernel: end_request: I/O error, dev sdk, sector 37109183 Aug 17 17:56:17 scst-target kernel: end_request: I/O error, dev sdk, sector 37108687 Aug 17 17:56:17 scst-target kernel: end_request: I/O error, dev sdk, sector 37108431 Aug 17 17:56:17 scst-target kernel: end_request: I/O error, dev sdk, sector 37108423 Aug 17 17:56:17 scst-target kernel: end_request: I/O error, dev sdk, sector 37108207 Aug 17 17:56:17 scst-target kernel: end_request: I/O error, dev sdk, sector 37106855 Aug 17 17:56:17 scst-target kernel: end_request: I/O error, dev sdk, sector 37107135 Aug 17 17:56:17 scst-target kernel: end_request: I/O error, dev sdk, sector 37107391 Aug 17 17:56:17 scst-target kernel: end_request: I/O error, dev sdk, sector 37107519 Aug 17 17:56:17 scst-target kernel: end_request: I/O error, dev sdk, sector 37107775 Aug 17 17:56:17 scst-target kernel: end_request: I/O error, dev sdk, sector 37107823 Aug 17 17:56:18 scst-target kernel: end_request: I/O error, dev sdk, sector 151465351 Aug 17 17:56:18 scst-target kernel: RAID5 conf printout: Aug 17 17:56:18 scst-target kernel: --- rd:4 wd:3 Aug 17 17:56:18 scst-target kernel: disk 0, o:1, dev:sdi1 Aug 17 17:56:18 scst-target kernel: disk 1, o:1, dev:sdj1 Aug 17 17:56:18 scst-target kernel: disk 2, o:0, dev:sdk1 Aug 17 17:56:18 scst-target kernel: disk 3, o:1, dev:sdl1 Aug 17 17:56:18 scst-target kernel: RAID5 conf printout: Aug 17 17:56:18 scst-target kernel: --- rd:4 wd:3 Aug 17 17:56:18 scst-target kernel: disk 0, o:1, dev:sdi1 Aug 17 17:56:18 scst-target kernel: disk 1, o:1, dev:sdj1 Aug 17 17:56:18 scst-target kernel: disk 3, o:1, dev:sdl1 Aug 17 19:37:16 scst-target sshd[4515]: refused connect from mon1 (::ffff:xxx.xx.xxx.xxx) --------------------------------------------------------------------------- As always, Dr. G.W. Wettstein, Ph.D. Enjellic Systems Development, LLC. 4206 N. 19th Ave. Specializing in information infra-structure Fargo, ND 58102 development. PH: 701-281-1686 FAX: 701-281-3949 EMAIL: greg@xxxxxxxxxxxx ------------------------------------------------------------------------------ "This patch causes a CONFIG_PREEMPT=y, CONFIG_PREEMPT_BKL=y, CONFIG_DEBUG_PREEMPT=y kernel on a ppc64 G5 to hang immediately after displaying the penguins, but apparently not before having set the hardware clock backwards 101 years." "After having carefully reviewed the above description and having decided that these effects were not a part of the patch's design intent I have temporarily set it aside, thanks." -- Andrew Morton linux-kernel -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html