Re: Failed path will not be recovered when disabling/enabling remote port

Hannes Reinecke <hare@xxxxxxx> · Tue, 21 Jul 2009 08:19:53 +0200

Hi Konrad,

Konrad Rzeszutek wrote:
>>> Could also be a race condition that is present in SLES10 + RHEL5
>>> kernels. Where the SysFS directories are created (and the udev event it
>>> sent out), but the kernel hasn't populated the SysFS directories. So
>>> when multipathd tries to read them it finds no pertient information and
>>> shoves it off to the 'orphan' state.
>>>
>> Really? With SLES10? Have you actually observed this?
> 
> With SLES10 SP2 to be exact. It wasn't an issue with SLES10 since the
> initial patch was there. The equipment I used to test this was an
> AX150FC with failed batteries (so no cache writes) and with a failed
> controller so it would run extra slow.
> 
>> We're running multipath _after_ udev has processed the event.
> 
> Right, the one where the SysFS directory is created. Then multipatd
> reads the data. I remember posting it here and mentioning that this
> problem exists on SLES10SP2 and RHEL5 but not on the upstream kernels.
> 
>> And udev already waited for sysfs, so we should be safe there.
> 
> Not so. The udev gets the SCSI uevent creation, creates the /dev/sdX, and
> so. But the kernel hasn't yet fully populated the SysFS entries (so
> /sys/block/sdX/device/vendor does exist, but has no data in it).
>> It might be applicable to mainline multipath-tools, but
> 
> It really depends on how the SysFS directories are populated and how
> slow the SCSI target is.
> 
>> the SLES10 one ... I'd be surprised.
>>
>> Well, reasonably surprised. multipath keeps on throwing
>> an amazing number of issues still.
>>
>> Do you have more information here?
> 
> Here is the patch along with a detailed description.
> 
> The "multipath-tools-add-wait" patch is a backport/write of the
> wait_for_file routine used in the sysfs_get_[vendor|model|rev]
> macros. The SLES10 SP2 back-ported a lot of the upstream features
> of multipath, and one of those was getting rid of this function.
> I haven't yet found out the reason why it was deleted - looks
> as if a mistake as the upstream kernel _should_ cause the same
> set of problems with multipath.
> [update: Upstream kernel has this fixed]
> 
> The reason a wait is necessary is due to the way the kernel
> sends the event. When a SCSI device is added the SCSI subsystem
> pursues this path:
> 
> _sysfs_add_sdev:
> 	calls device_add ...
>  	[ '/devices/platform/host16/session6/target16:0:0/16:0:0:17'] uevent
> 		bus_attach_device
> 			bus_for_each_drv
> 				driver_probe_device
> 					sd_probe
>  					['/class/scsi_disk/16:0:0:17' ] uevent
> 						add_disk
>  						['/block/sdai'] [ Here multipath starts its job ]
> 
> 	calls class_device_add ...
> 		[ '/class/scsi_device/16:0:0:17' ] uevent
> 		sg_add:
> 			[ '/class/scsi_generic/sg35' ] uevent
> 
> 
> 	done with device_add, and now we add the  attributes:
> 	--> scsi_sysfs_sdev_attrs[i].vendor, model, rev <-- THIS is the
> problem.
> 
> [Multipathd at the 'block/sdai' event has started analyzing the data, and
> it reads the SysFS, but the 'vendor', 'model' have no data so multipathd
> discards them an orphans the devices. That data gets to be there once
> 'device_add' is finished.]
> 
Ah. Hmm. Seems you are correct.

I'll have to apply the patch, then.

Fancy opening a bugzilla for it?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@xxxxxxx			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel