Re: Failed path will not be recovered when disabling/enabling remote port

Konrad Rzeszutek <konrad@xxxxxxxxxxxxxxx> · Thu, 2 Jul 2009 09:06:19 -0400

On Thu, Jul 02, 2009 at 01:44:18PM +0200, Hannes Reinecke wrote:
> Christian May wrote:
> > Hi,
> > 
> > I've setup an IBM z10 LPAR (mainframe server) with 2.6.30-kernel.
> > Attached to the System z10 was an IBM DS8000 storage server. 10x SCSI
> > LUNs were assigned to LPAR via two pathes:
> > 
> > Example:
> > 36005076303ffc1040000000000001269 dm-9 IBM,2107900
> > size=1.0G features='1 queue_if_no_path' hwhandler='0' wp=rw
> > `-+- policy='round-robin 0' prio=-2 status=active
> >  |- 0:0:0:1080639506 sdw   65:96  active undef running
> >  `- 1:0:1:1080639506 sdt   65:48  active undef running
> > 
> > Special parameter setting: dev_loss_tmo=90sec; fast_io_fail_tmo=5sec
> > 
> > multipath tools: multipath-tools v0.4.9 (04/04, 2009)
> > device-mapper: device-mapper-1.02.27-7.fc10.s390x,
> > device-mapper-libs-1.02.27-7.fc10.s390x
> > 
> > When removing a remote port (disabling a port on the BROCADE FC switch)
> > one path failed.
> > 
> > root@h42lp26/ESAME:~]
> >> multipath -l
> > 36005076303ffc1040000000000001268 dm-8 ,
> > size=1.0G features='1 queue_if_no_path' hwhandler='0' wp=rw
> > `-+- policy='round-robin 0' prio=-2 status=active
> >  |- #:#:#:#          -     #:#   failed undef running
> >  `- 1:0:1:1080573970 sdr   65:16 active undef running
> > 
> > After a while (>90sec) SCSI LUNs were removed from system:
> > 
> [ .. ]
> > 
> > When re-enabling the path, SCSI LUNS were reassigned to system but path
> > didn't recover:
> > 
> [ .. ]
> 
> > 
> > 
> > [root@h42lp26/ESAME:~]
> >> multipath -l
> > 36005076303ffc1040000000000001268 dm-8 ,
> > size=1.0G features='1 queue_if_no_path' hwhandler='0' wp=rw
> > `-+- policy='round-robin 0' prio=-2 status=active
> >  |- #:#:#:#          -     #:#    failed undef running
> >  `- 1:0:1:1080573970 sdr   65:16  active undef running
> > 
> > 
> > Running "multipath" command will recover the failed path but that's not
> > way it should be...can somebody help to fix this? Why is the path not
> > recovered automatically?
> > 
> It should, really.
> 
> The problem is that the paths have _not_ been reconnected;
> the hashes indicates that the in-kernel multipath code references
> a device for which no information is available.
> And the new device has _not_ been reconnected, as otherwise
> you'd end up with _three_ paths here.
> 
> Probably missing udev integration.

Could also be a race condition that is present in SLES10 + RHEL5
kernels. Where the SysFS directories are created (and the udev event it
sent out), but the kernel hasn't populated the SysFS directories. So
when multipathd tries to read them it finds no pertient information and
shoves it off to the 'orphan' state.

I did post a patch for this a while back. Granted this isn't a problem
with the more recent kernels.
> 
> I really have to push my patches upstream ... sigh.
> 
> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke		      zSeries & Storage
> hare@xxxxxxx			      +49 911 74053 688
> SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: Markus Rex, HRB 16746 (AG Nürnberg)
> 
> --
> dm-devel mailing list
> dm-devel@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel