Good morning, hope the end of the week is going well for everyone. Apologies for the rather wide coverage on this note but I wanted to make sure all involved parties were in the loop. We've been chasing a series of anomalies in a large production SAN environment involving MD/RAID1 and the sysfs/kobject system. I was able to get full instrumentation and logging on the issues during a systems failure early this morning and wanted to get a report out on what was found. The SAN initiators use RAID1 mirrors to access SAN targets in two geographically isolated data-centers. We've been having issues with SAN targets periodically failing which present as I/O's never completing to the RAID1 driver. As has been discussed earlier on linux-raid this is an error scenario which the RAID1 driver cannot address short of a timer based solution which arguably involves a layering violation. As processes on the initiators attempt to do I/O they go into D state which eventually produces very high load levels. Our response to this is to restart the hung target. This generates a LIP in the affected zone which the HBA's can of course detect which in turn results in the underlying device being failed from the mirror. For completeness sake and for anyone GOOGLEing this issue the RHEL5.1 kernel (2.6.18-53.1.13.el5) gets the situation completely wrong. The RAID1 driver detects the anomaly, kicks the device and indicates its continueing on one device but then throws an I/O error up through LVM and into the overlying filesystem. This causes the filesystem to go read-only but in some cases not before generating filesystem corruption which needs to be corrected by a full ext3 filesystem check. To make this somewhat more relevant to the kernel developers a couple of initiators were platformed on the most recent 2.6.22/2.6.23 kernels. I know even these kernels are dated but these are full production systems which has kept us off 2.6.24 for the time being. The 2.6.22.x/2.6.23.x kernels see the same I/O stalls when the targets hang but they do handle the failure scenario correctly. The MD device picks up on the target reboot induced LIP, fails out the device and the machines go forward as they should. On these kernels we do see a problem which I'm interpreting as secondary to lifetime issues between the MD driver and kobject/sysfs. Here is an exerpt from logging at 'info' and 'notice' priority levels for the event: --------------------------------------------------------------------------- Feb 22 01:36:54 rsg1 kernel: qla2xxx 0000:03:09.0: scsi(3:0:1): Abort command issued -- 1 3818fa 2002. Feb 22 01:37:20 rsg1 hotplug: Event 871 requested remove for device: scsi_disk Feb 22 01:37:20 rsg1 hotplug: Event 870 requested remove for device: scsi_device Feb 22 01:37:20 rsg1 hotplug: Event 872 requested remove for device: block Feb 22 01:37:20 rsg1 hotplug: Event 873 requested remove for device: block Feb 22 01:37:20 rsg1 hotplug: Event 874 requested remove for device: scsi Feb 22 01:37:24 rsg1 kernel: qla2xxx 0000:03:09.0: scsi(3:0:1): Abort command issued -- 1 3818fb 2002. Feb 22 01:37:24 rsg1 kernel: scsi 3:0:0:1: scsi: Device offlined - not ready after error recovery Feb 22 01:37:24 rsg1 kernel: scsi 3:0:0:1: [sdd] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK Feb 22 01:39:24 rsg1 kernel: scsi scan: INQUIRY result too short (5), using 36 Feb 22 01:39:24 rsg1 kernel: scsi 3:0:0:1: Direct-Access FCTARGET lvm(58,6) 0.9 PQ: 0 ANSI: 3 Feb 22 01:39:24 rsg1 kernel: sd 3:0:0:1: [sde] 418775040 512-byte hardware sectors (214413 MB) Feb 22 01:39:24 rsg1 kernel: sd 3:0:0:1: [sde] Write Protect is off Feb 22 01:39:24 rsg1 kernel: sd 3:0:0:1: [sde] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA Feb 22 01:39:24 rsg1 hotplug: Event 876 requested add for device: scsi_disk Feb 22 01:39:24 rsg1 hotplug: Event 875 requested add for device: scsi Feb 22 01:39:24 rsg1 kernel: sde: sde1 Feb 22 01:39:24 rsg1 hotplug: Event 877 requested add for device: block Feb 22 01:39:24 rsg1 hotplug: Event 878 requested add for device: block Feb 22 01:39:24 rsg1 kernel: sd 3:0:0:1: [sde] Attached SCSI disk --------------------------------------------------------------------------- Which corresponds to the following messages at 'warn' priority: --------------------------------------------------------------------------- Feb 22 01:37:20 rsg1 kernel: rport-3:0-1: blocked FC remote port time out: removing target and saving binding Feb 22 01:37:24 rsg1 kernel: end_request: I/O error, dev sdd, sector 418774920 Feb 22 01:37:24 rsg1 kernel: md: super_written gets error=-5, uptodate=0 Feb 22 01:37:24 rsg1 kernel: raid1: Disk failure on sdd1, disabling device. Feb 22 01:37:24 rsg1 kernel: ^IOperation continuing on 1 devices Feb 22 01:37:24 rsg1 kernel: scsi 3:0:0:1: rejecting I/O to dead device Feb 22 01:37:24 rsg1 kernel: md: super_written gets error=-5, uptodate=0 Feb 22 01:37:24 rsg1 kernel: scsi 3:0:0:1: rejecting I/O to dead device Feb 22 01:37:24 rsg1 kernel: md: super_written gets error=-5, uptodate=0 Feb 22 01:37:24 rsg1 kernel: RAID1 conf printout: Feb 22 01:37:24 rsg1 kernel: --- wd:1 rd:2 Feb 22 01:37:24 rsg1 kernel: disk 0, wo:0, o:1, dev:sdc1 Feb 22 01:37:24 rsg1 kernel: disk 1, wo:1, o:0, dev:sdd1 Feb 22 01:37:24 rsg1 kernel: RAID1 conf printout: Feb 22 01:37:24 rsg1 kernel: --- wd:1 rd:2 Feb 22 01:37:24 rsg1 kernel: disk 0, wo:0, o:1, dev:sdc1 Feb 22 01:39:26 rsg1 kernel: kobject_add failed for 3:0:0:1 with -EEXIST, don't try to register things with the same name in the same directory. Feb 22 01:39:26 rsg1 kernel: [<c01bddde>] kobject_shadow_add+0x101/0x10a Feb 22 01:39:26 rsg1 kernel: [<c01edc5d>] device_add+0x7c/0x374 Feb 22 01:39:26 rsg1 kernel: [<c020cb47>] scsi_sysfs_add_sdev+0x27/0x148 Feb 22 01:39:26 rsg1 kernel: [<c020b09f>] scsi_add_lun+0x2b7/0x2cc Feb 22 01:39:26 rsg1 kernel: [<c020b228>] scsi_probe_and_add_lun+0x174/0x206 Feb 22 01:39:26 rsg1 kernel: [<c020b37d>] scsi_sequential_lun_scan+0xc3/0xda Feb 22 01:39:26 rsg1 kernel: [<c020b98e>] __scsi_scan_target+0xd1/0xe8 Feb 22 01:39:26 rsg1 kernel: [<c020ba4d>] scsi_scan_target+0xa8/0xc2 Feb 22 01:39:26 rsg1 kernel: [<c0213529>] fc_scsi_scan_rport+0x0/0x73 Feb 22 01:39:26 rsg1 kernel: [<c021357e>] fc_scsi_scan_rport+0x55/0x73 Feb 22 01:39:26 rsg1 kernel: [<c01227f9>] run_workqueue+0x77/0xf3 Feb 22 01:39:26 rsg1 kernel: [<c0122875>] worker_thread+0x0/0xb1 Feb 22 01:39:26 rsg1 kernel: [<c012291c>] worker_thread+0xa7/0xb1 Feb 22 01:39:26 rsg1 kernel: [<c01256e9>] autoremove_wake_function+0x0/0x33 Feb 22 01:39:26 rsg1 kernel: [<c01256e9>] autoremove_wake_function+0x0/0x33 Feb 22 01:39:26 rsg1 kernel: [<c0122875>] worker_thread+0x0/0xb1 Feb 22 01:39:26 rsg1 kernel: [<c0125307>] kthread+0x34/0x55 Feb 22 01:39:26 rsg1 kernel: [<c01252d3>] kthread+0x0/0x55 Feb 22 01:39:26 rsg1 kernel: [<c01030c7>] kernel_thread_helper+0x7/0x10 Feb 22 01:39:26 rsg1 kernel: ======================= --------------------------------------------------------------------------- As can be seen from the first set of logs the /dev/sdd device is what gets kicked out of the RAID1 device. The SAN target coming back on-line triggers an event which results in the same device being re-discovered as /dev/sde. My assumption is device discovery is forced as the 'next' SCSI device due to the fact the RAID1 MD device is still holding a reference to the failed /dev/sdd device, thus preventing its re-use. Thats at least how I interpret the: Feb 22 01:39:26 rsg1 kernel: kobject_add failed for 3:0:0:1 with -EEXIST, don't try to register things with the same name in the same directory. And the resultant stack trace. The interaction of this with udev tends to complicate the situation from a systems administration point of view. The loss of the /dev/sdd device is picked up by udevd which causes node entries for the block device and its associated partitions to be removed. This tends to violate the premise of 'least surprise' when the systems staff logs in to recover the situation and can't remove the failed device from the RAID1 device and are potentially forced to add a completely different device back into the MD device to recover from the event. Neil/Greg our production SUSE boxes running the 2.6.16.54-0.2.5-smp kernel get all this completely right. They handle the device failure and don't end up re-discovering the target as the next SCSI device. If the issues we are seeing are due to a change in block device or sysfs/kobject behavior since 2.6.16 future SUSE releases may be affected. If the behavior is due to better udev rules you guys win... :-) Is it correct to assume in the 'do everything in userspace world' that the way to deal with this is to trap device removal with hotplug scripts and automatically remove the failed component from whatever MD device it may be in? If so I don't currently see any type of support for this in current RHEL5.1 or SUSE userspace. If the answer to the latter question is yes it would seem much easier to have some type of DEVICE=md hotplug event triggered in order to more precisely target the needed intervention. Currently the only solution seems to be to grope through all the configured MD devices to try and figure out what to do with a device removal event. Al sorry to force you to wade through this but if my analysis is correct you can add this as fodder to your sysfs locking/lifetime rants... :-) Once again I apologize for bringing this up against 'old' kernels but since we needed to instrument this on production systems we were somewhat limited in our choices. If all this has been fixed in 2.6.2[45] let me know since we would obviously want to use those kernels as replacements on the RHEL5.1 platforms. Let me know if additional information or clarification is needed. Best wishes for a pleasant weekend. As always, Dr. G.W. Wettstein, Ph.D. Enjellic Systems Development, LLC. 4206 N. 19th Ave. Specializing in information infra-structure Fargo, ND 58102 development. PH: 701-281-1686 FAX: 701-281-3949 EMAIL: greg@xxxxxxxxxxxx ------------------------------------------------------------------------------ "Given a choice between a complex, difficult-to-understand, disconcerting explanation and a simplistic, comforting one, many prefer simplistic comfort if it's remotely plausible, especially if it involves blaming someone else for their problems." -- Bob Lewis _Infoworld_ - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html