[[Sorry for the dup message; I used the wrong address on the last one]] On Wed, 8 Jul 2009 at 22:44:21 +0200 (CEST), Christophe Varoqui wrote: > Mike Christie is working on a patchset to let target errors have to > scsi-layer retries treatment while retaining the failfast behaviour > for transport errors. > > This work should benefit to your problem too, if I'm not mistaken. Thanks for the response Christophe... just to be clear: you're saying that my situation is known to not work currently in Linux, and is not supported until this new work by Mike is ready? IOW, this is not a bug but rather unimplemented functionality? > On Tue, 2009-07-07 at 17:26 -0400, Paul Smith wrote: > > I have a configuration that has raid1 mirrors (md_raid1) built on top of > > linear segments of multipath'd scsi disks (dm-multipath). This is Linux > > 2.6.27.25, FYI. Unfortunately because this is an embedded environment > > it's not easy for us to jump to a newer kernel. > > > > In this configuration when a scsi disk reports a media failure (SCSI > > Sense/ASC/ASCQ: 3/11/0), I would expect that the md_raid1 would be able > > to handle the error and read the data from the other mirror and then > > re-write the failed sector on the original image. > > > > I have tried this with the no_path_retry attribute as 'fail' and > > observe the following... > > > > dm-multipath reports the path failure. > > Then it tries the request on the other path, which also gets a path > > failure. > > The failure of both paths fails the device. > > > > md-raid1 gets the error and reads from the other mirror. > > When md-raid1 tries to re-write the data it encounters the failed > > device. > > > > >From syslog: > > > > Jul 1 20:46:27 hostname user.info kernel: sd 3:0:3:0: [sdaz] Result: hostbyte=0x00 driverbyte=0x08 > > Jul 1 20:46:27 hostname user.info kernel: sd 3:0:3:0: [sdaz] Sense Key : 0x3 [current] > > Jul 1 20:46:27 hostname user.warn kernel: Info fld=0x217795c > > Jul 1 20:46:27 hostname user.info kernel: sd 3:0:3:0: [sdaz] ASC=0x11 ASCQ=0x0 > > Jul 1 20:46:27 hostname user.warn kernel: device-mapper: multipath: Failing path 67:48. > > Jul 1 20:46:27 hostname daemon.notice multipathd: 67:48: mark as failed > > Jul 1 20:46:27 hostname daemon.notice multipathd: encl3Slot4: remaining active paths: 1 > > Jul 1 20:46:29 hostname user.info kernel: sd 2:0:29:0: [sdab] Result: hostbyte=0x00 driverbyte=0x08 > > Jul 1 20:46:29 hostname user.info kernel: sd 2:0:29:0: [sdab] Sense Key : 0x3 [current] > > Jul 1 20:46:29 hostname user.warn kernel: Info fld=0x217795c > > Jul 1 20:46:29 hostname user.info kernel: sd 2:0:29:0: [sdab] ASC=0x11 ASCQ=0x0 > > Jul 1 20:46:29 hostname user.warn kernel: device-mapper: multipath: Failing path 65:176. > > Jul 1 20:46:30 hostname daemon.notice multipathd: 65:176: mark as failed > > Jul 1 20:46:30 hostname daemon.notice multipathd: encl3Slot4: remaining active paths: 0 > > Jul 1 20:46:30 hostname user.err kernel: raid1: dm-36: rescheduling sector 35092739 > > Jul 1 20:46:30 hostname user.alert kernel: raid1: Disk failure on dm-38, disabling device. > > Jul 1 20:46:30 hostname user.alert kernel: raid1: Operation continuing on 1 devices. > > Jul 1 20:46:30 hostname user.warn kernel: md: super_written gets error=-5, uptodate=0 > > Jul 1 20:46:30 hostname user.alert kernel: raid1: Disk failure on dm-36, disabling device. > > Jul 1 20:46:30 hostname user.alert kernel: raid1: Operation continuing on 1 devices. > > > > When the no_path_retry attribute is set to '3' : > > > > dm-multipath reports the path failure. > > Then retries the request on the other path, which also gets a path > > failure. > > On the failure of the second path, the device queues, and enters > > recovery mode. > > On the subsequent poll of the paths they are reinstated and the IOs are > > 'resumed'..... > > ... and of course fail with the media error again....... > > causing a hang... > > > > >From syslog: > > > > Jul 7 10:54:35 hostname user.info kernel: sd 2:0:19:0: [sds] Result: hostbyte=0x00 driverbyte=0x08 > > Jul 7 10:54:35 hostname user.info kernel: sd 2:0:19:0: [sds] Sense Key : 0x3 [current] > > Jul 7 10:54:35 hostname user.warn kernel: Info fld=0x123c016d > > Jul 7 10:54:35 hostname user.info kernel: sd 2:0:19:0: [sds] ASC=0x11 ASCQ=0x0 > > Jul 7 10:54:35 hostname user.warn kernel: device-mapper: multipath: Failing path 65:32. > > Jul 7 10:54:35 hostname daemon.notice multipathd: 65:32: mark as failed > > Jul 7 10:54:35 hostname daemon.notice multipathd: encl2Slot7: remaining active paths: 1 > > Jul 7 10:54:37 hostname user.info kernel: sd 3:0:45:0: [sdcm] > > Jul 7 10:54:37 hostname user.info kernel: Result: hostbyte=0x00 driverbyte=0x08 > > Jul 7 10:54:37 hostname user.info kernel: sd 3:0:45:0: [sdcm] Sense Key : 0x3 > > Jul 7 10:54:37 hostname user.info kernel: [current] > > Jul 7 10:54:37 hostname user.warn kernel: Info fld=0x123c016d > > Jul 7 10:54:37 hostname user.info kernel: sd 3:0:45:0: [sdcm] > > Jul 7 10:54:37 hostname user.info kernel: ASC=0x11 ASCQ=0x0 > > Jul 7 10:54:37 hostname user.warn kernel: device-mapper: multipath: Failing path 69:160. > > Jul 7 10:54:37 hostname daemon.notice multipathd: 69:160: mark as failed > > Jul 7 10:54:37 hostname daemon.warn multipathd: encl2Slot7: Entering recovery mode: max_retries=3 > > Jul 7 10:54:37 hostname daemon.notice multipathd: encl2Slot7: remaining active paths: 0 > > Jul 7 10:54:39 hostname daemon.warn multipathd: sds: readsector0 checker reports path is up > > Jul 7 10:54:39 hostname daemon.notice multipathd: 65:32: reinstated > > Jul 7 10:54:42 hostname daemon.notice multipathd: encl2Slot7: queue_if_no_path enabled > > Jul 7 10:54:42 hostname daemon.warn multipathd: encl2Slot7: Recovered to normal mode > > Jul 7 10:54:42 hostname daemon.notice multipathd: encl2Slot7: remaining active paths: 1 > > Jul 7 10:54:42 hostname daemon.warn multipathd: sdcm: readsector0 checker reports path is up > > Jul 7 10:54:42 hostname daemon.notice multipathd: 69:160: reinstated > > Jul 7 10:54:42 hostname daemon.notice multipathd: encl2Slot7: remaining active paths: 2 > > Jul 7 10:54:42 hostname user.info kernel: sd 2:0:19:0: [sds] Result: hostbyte=0x00 driverbyte=0x08 > > Jul 7 10:54:42 hostname user.info kernel: sd 2:0:19:0: [sds] > > Jul 7 10:54:42 hostname user.info kernel: Sense Key : 0x3 > > Jul 7 10:54:42 hostname user.info kernel: [current] > > Jul 7 10:54:42 hostname user.info kernel: > > Jul 7 10:54:42 hostname user.warn kernel: Info fld=0x123c016d > > Jul 7 10:54:42 hostname user.info kernel: sd 2:0:19:0: [sds] > > Jul 7 10:54:42 hostname user.info kernel: ASC=0x11 ASCQ=0x0 > > Jul 7 10:54:42 hostname user.warn kernel: device-mapper: multipath: Failing path 65:32. > > Jul 7 10:54:43 hostname daemon.notice multipathd: 65:32: mark as failed > > Jul 7 10:54:43 hostname daemon.notice multipathd: encl2Slot7: remaining active paths: 1 > > > > > > Should dm-multipath distinguish media failures from actual device > > errors? > > > > Is there a different no_path_retry policy that would fail this request > > by queue subsequent requests? > > > > > > -- > > dm-devel mailing list > > dm-devel@xxxxxxxxxx > > https://www.redhat.com/mailman/listinfo/dm-devel > > -- > dm-devel mailing list > dm-devel@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/dm-devel -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel