Hi Alasdair,
We are seeing an IO error problem on a DM device, when the HBA ports of another host , seen through the same switch are disabled/enable . We are not understanding on why the paths are failed when ports on other hosts are disabled. Please explain.
Below is the problem description and steps to reproduce.
Problem : I/O Error on DM device on one host when HBA ports of another host are disabled.
OS distros : RHEL4.0 U2/U3.
HOW-TO reproduce the problem :
1. Configure 2 storage arrays (A1 , A2) and two host (H1, H2) in the same zone, so that both the hosts can see both the arrays. Create and p resent LUNs (L1, L2) from array (A1) to host (H1)
2. Stop the multipathd daemon (for testing purpose on why the IO error when ports of other hosts are failed) . Not stopping it may take long time to reproduce the problem.
3. Start I/O on DM device representing luns L1 and L2 on host H1. We used dt tool for IO exercising.
4. Disable host ports of host H2 or any port of array A2 one after the other (few times) OR disable and enable the same port of the other host – few times (may be 4-5 times).
5. Application (dt tool) aborts with IO error on host H1.
=====
Snippet of sys log output (while do ing I/O on /dev/dm-0 )
Feb 1 11:47:14 apwtest52 kernel: SCSI error : <2 0 0 1> return code = 0x20000
Feb 1 11:47:14 apwtest52 kernel: end_request: I/O error, dev sda, sector 1584600
Feb 1 11:47:14 apwtest52 kernel: device-mapper: dm-multipath: Failing path 8:0. <=================path failed, after disabling/enabling the H2 host port 1
Feb 1 11:47:14 apwtest52 kernel: end_request: I/O error, dev sda, sector 1584608
Feb 1 11:47:45 apwtest52 kernel: SCSI error : <3 0 1 1> return code = 0x20000
Feb 1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sdg, sector 861400
Feb 1 11:47:45 apwtest52 kernel: device-mapper: dm-multipath: Failing path 8:96. <=================path failed, after disabling/enabling the H2 host port 2
Feb 1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sdg, sector 861408
Feb 1 11:47:45 apwtest52 kernel: SCSI error : <3 0 0 1> return code = 0x20000
Feb 1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde, sector 452760
Feb 1 11:47:45 apwtest52 kernel: device-mapper: dm-multipath: Failing path 8:64. <=================path failed after disabling/enabling the H2 host port 1
Feb 1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde, sector 452768
Feb 1 11:47:45 apwtest52 kernel: SCSI error : <3 0 0 1> return code = 0x20000
Feb 1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde, sector 453784
Feb 1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde, sector 453792
Feb 1 11:47:45 apwtest52 kernel: SCSI error : <3 0 0 1> return code = 0x20000
Feb 1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde, sector 454808
Feb 1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde, sector 454816
Feb 1 11:47:45 apwtest52 kernel: SCSI error : <3 0 0 1> return code = 0x20000
Feb 1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde, sector 863960
Feb 1 11:47:45 apwtest52 kernel: end_request: I/O error, dev sde, sector 863968
Feb 1 11:48:40 apwtest52 kernel: SCSI error : <2 0 1 1> return code = 0x20000
Feb 1 11:48:40 apwtest52 kernel: end_request: I/O error, dev sdc, sector 935384
Feb 1 11:48:40 apwtest52 kernel: device-mapper: dm-multipath: Failing path 8:32. <================= after disabling/enabling the H2 host port 2
Feb 1 11:48:40 apwtest52 kernel: end_request: I/O error, dev sdc, sector 935392
Feb 1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116924 <============All path to the device /dev/dm-0 failed
Feb 1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116925
Feb 1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116926
Feb 1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116927
Feb 1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116928
Feb 1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116929
Feb 1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116930
Feb 1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116931
Feb 1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116932
Feb 1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116933
Feb 1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116934
Feb 1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116935
Feb 1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116936
Feb 1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116937
Feb 1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116938
Feb 1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116939
Feb 1 11:48:40 apwtest52 kernel: Buffer I/O error on device dm-0, logical block 116940
Observations :
As we do the port failure on the other host, paths of the dm device is failed and the subsequent por t (i.e A2 or H2 ports) disabling/enabling results into more numbers of path failure and that leads into all path failure condition , which in turn results into IO error on RHEL4.0 U2/U3.
Through the device-mapper debug driver we are finding that the there is no valid path in __choose_pgpath() and m->current_pgpath (m is pointer to struct multipath) is null when it comes to map_io() in dm-mpath.c.
Another observation is that we are not seeing any IO errors when the same test is executed on SLES9 SP3/SP4.
The 0x20000 errors you are seeing corresponds to DID_BUS_BUSY error. This is considered one of the 'retryable' errors by the SCSI layer. As far as I know, Qlogic driver uses the DID_BUS_BUSY error return code to force a retry of I/O for various fabric events. (I think they are planning on, or maybe already have, cleaned this up in their driver to remove this hack and use block/unblock interface. This could be double-checked by looking at qlogic source code for the driver version you're using). Since dm I/O uses a failfast flag, these retryable errors won't get retried by the SCSI layer and get immediately propagated up to dm, which is probably why you're getting errors even on paths that should be okay. If this is the case, I would expect this same problem to occur on your SLES9 systems too if you're using QLogic driver.
I assume you're not using queue_if_no_path? From a dm and user perspective, this is the only thing I can think of to work around this issue until the patch to propagate error codes up the stack is included and/or Qlogic stops using DID_BUS_BUSY to force retries.
Thanks,
lan
Please provide some pointers on why we are seeing this behavior or is this a known thing at this point in time?
Thanks and regards
-Murthy
--
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel
-- dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel