Re: Media failures cause Path/Device Failures in dm-multipath ?

christophe.varoqui@xxxxxxx · Wed, 8 Jul 2009 22:44:21 +0200 (CEST)

Mike Christie is working on a patchset to let target errors have to scsi-layer retries treatment while retaining the failfast behaviour for transport errors.

This work should benefit to your problem too, if I'm not mistaken.

Regards,
cvaroqui

----- Mail Original -----
De: "Paul Smith" <paul@xxxxxxxxxxxxxxxxx>
À: "device-mapper development" <dm-devel@xxxxxxxxxx>
Envoyé: Mercredi 8 Juillet 2009 21h34:56 GMT +01:00 Amsterdam / Berlin / Berne / Rome / Stockholm / Vienne
Objet: Re:  Media failures cause Path/Device Failures in dm-multipath ?

Hi all; does anyone have any thoughts about/comments on this?

It's kind of not so useful to be using this stuff if I can't recover
from a media failure, after all... what am I doing wrong?

Thanks!

On Tue, 2009-07-07 at 17:26 -0400, Paul Smith wrote:
> I have a configuration that has raid1 mirrors (md_raid1) built on top of
> linear segments of multipath'd scsi disks (dm-multipath).  This is Linux
> 2.6.27.25, FYI.  Unfortunately because this is an embedded environment
> it's not easy for us to jump to a newer kernel.
> 
> In this configuration when a scsi disk reports a media failure (SCSI
> Sense/ASC/ASCQ: 3/11/0), I would expect that the md_raid1 would be able
> to handle the error and read the data from the other mirror and then
> re-write the failed sector on the original image.
> 
> I have tried this with the no_path_retry attribute as 'fail'  and
> observe the following... 
> 
> dm-multipath reports the path failure. 
> Then it tries the request on the other path, which also gets a path
> failure.
> The failure of both paths fails the device.
> 
> md-raid1 gets the error and reads from the other mirror.
> When md-raid1 tries to re-write the data it encounters the failed
> device.
> 
> >From syslog:
> 
> Jul 1 20:46:27 hostname user.info kernel: sd 3:0:3:0: [sdaz] Result: hostbyte=0x00 driverbyte=0x08
> Jul 1 20:46:27 hostname user.info kernel: sd 3:0:3:0: [sdaz] Sense Key : 0x3 [current]
> Jul 1 20:46:27 hostname user.warn kernel: Info fld=0x217795c
> Jul 1 20:46:27 hostname user.info kernel: sd 3:0:3:0: [sdaz] ASC=0x11 ASCQ=0x0
> Jul 1 20:46:27 hostname user.warn kernel: device-mapper: multipath: Failing path 67:48.
> Jul 1 20:46:27 hostname daemon.notice multipathd: 67:48: mark as failed
> Jul 1 20:46:27 hostname daemon.notice multipathd: encl3Slot4: remaining active paths: 1
> Jul 1 20:46:29 hostname user.info kernel: sd 2:0:29:0: [sdab] Result: hostbyte=0x00 driverbyte=0x08
> Jul 1 20:46:29 hostname user.info kernel: sd 2:0:29:0: [sdab] Sense Key : 0x3 [current]
> Jul 1 20:46:29 hostname user.warn kernel: Info fld=0x217795c
> Jul 1 20:46:29 hostname user.info kernel: sd 2:0:29:0: [sdab] ASC=0x11 ASCQ=0x0
> Jul 1 20:46:29 hostname user.warn kernel: device-mapper: multipath: Failing path 65:176.
> Jul 1 20:46:30 hostname daemon.notice multipathd: 65:176: mark as failed
> Jul 1 20:46:30 hostname daemon.notice multipathd: encl3Slot4: remaining active paths: 0
> Jul 1 20:46:30 hostname user.err kernel: raid1: dm-36: rescheduling sector 35092739
> Jul 1 20:46:30 hostname user.alert kernel: raid1: Disk failure on dm-38, disabling device.
> Jul 1 20:46:30 hostname user.alert kernel: raid1: Operation continuing on 1 devices.
> Jul 1 20:46:30 hostname user.warn kernel: md: super_written gets error=-5, uptodate=0
> Jul 1 20:46:30 hostname user.alert kernel: raid1: Disk failure on dm-36, disabling device.
> Jul 1 20:46:30 hostname user.alert kernel: raid1: Operation continuing on 1 devices.
> 
> When the no_path_retry attribute is set to '3' :
> 
> dm-multipath reports the path failure.
> Then retries the request on the other path, which also gets a path
> failure.
> On the failure of the second path, the device queues, and enters
> recovery mode.
> On the subsequent poll of the paths they are reinstated and the IOs are
> 'resumed'.....
> ... and of course fail with the media error again.......
> causing a hang...
> 
> >From syslog:
> 
> Jul  7 10:54:35 hostname user.info kernel: sd 2:0:19:0: [sds] Result: hostbyte=0x00 driverbyte=0x08
> Jul  7 10:54:35 hostname user.info kernel: sd 2:0:19:0: [sds] Sense Key : 0x3 [current]
> Jul  7 10:54:35 hostname user.warn kernel: Info fld=0x123c016d
> Jul  7 10:54:35 hostname user.info kernel: sd 2:0:19:0: [sds] ASC=0x11 ASCQ=0x0
> Jul  7 10:54:35 hostname user.warn kernel: device-mapper: multipath: Failing path 65:32.
> Jul  7 10:54:35 hostname daemon.notice multipathd: 65:32: mark as failed
> Jul  7 10:54:35 hostname daemon.notice multipathd: encl2Slot7: remaining active paths: 1
> Jul  7 10:54:37 hostname user.info kernel: sd 3:0:45:0: [sdcm]
> Jul  7 10:54:37 hostname user.info kernel: Result: hostbyte=0x00 driverbyte=0x08
> Jul  7 10:54:37 hostname user.info kernel: sd 3:0:45:0: [sdcm] Sense Key : 0x3
> Jul  7 10:54:37 hostname user.info kernel: [current]
> Jul  7 10:54:37 hostname user.warn kernel: Info fld=0x123c016d
> Jul  7 10:54:37 hostname user.info kernel: sd 3:0:45:0: [sdcm]
> Jul  7 10:54:37 hostname user.info kernel: ASC=0x11 ASCQ=0x0
> Jul  7 10:54:37 hostname user.warn kernel: device-mapper: multipath: Failing path 69:160.
> Jul  7 10:54:37 hostname daemon.notice multipathd: 69:160: mark as failed
> Jul  7 10:54:37 hostname daemon.warn multipathd: encl2Slot7: Entering recovery mode: max_retries=3
> Jul  7 10:54:37 hostname daemon.notice multipathd: encl2Slot7: remaining active paths: 0
> Jul  7 10:54:39 hostname daemon.warn multipathd: sds: readsector0 checker reports path is up
> Jul  7 10:54:39 hostname daemon.notice multipathd: 65:32: reinstated
> Jul  7 10:54:42 hostname daemon.notice multipathd: encl2Slot7: queue_if_no_path enabled
> Jul  7 10:54:42 hostname daemon.warn multipathd: encl2Slot7: Recovered to normal mode
> Jul  7 10:54:42 hostname daemon.notice multipathd: encl2Slot7: remaining active paths: 1
> Jul  7 10:54:42 hostname daemon.warn multipathd: sdcm: readsector0 checker reports path is up
> Jul  7 10:54:42 hostname daemon.notice multipathd: 69:160: reinstated
> Jul  7 10:54:42 hostname daemon.notice multipathd: encl2Slot7: remaining active paths: 2
> Jul  7 10:54:42 hostname user.info kernel: sd 2:0:19:0: [sds] Result: hostbyte=0x00 driverbyte=0x08
> Jul  7 10:54:42 hostname user.info kernel: sd 2:0:19:0: [sds]
> Jul  7 10:54:42 hostname user.info kernel: Sense Key : 0x3
> Jul  7 10:54:42 hostname user.info kernel: [current]
> Jul  7 10:54:42 hostname user.info kernel:
> Jul  7 10:54:42 hostname user.warn kernel: Info fld=0x123c016d
> Jul  7 10:54:42 hostname user.info kernel: sd 2:0:19:0: [sds]
> Jul  7 10:54:42 hostname user.info kernel: ASC=0x11 ASCQ=0x0
> Jul  7 10:54:42 hostname user.warn kernel: device-mapper: multipath: Failing path 65:32.
> Jul  7 10:54:43 hostname daemon.notice multipathd: 65:32: mark as failed
> Jul  7 10:54:43 hostname daemon.notice multipathd: encl2Slot7: remaining active paths: 1
> 
> 
> Should dm-multipath distinguish media failures from actual device
> errors?
> 
> Is there a different no_path_retry policy that would fail this request
> by queue subsequent requests?
> 
> 
> --
> dm-devel mailing list
> dm-devel@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel