One small correction In the cut and past the mpath timing was this. I had a cut and past error in my prior message. Fri Apr 29 17:16:14 EDT 2016 mpathe (360014052a6f5f9f256d4c1097eedfd95) dm-2 LIO-ORG ,block-13 size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw `-+- policy='service-time 0' prio=1 status=active |- 0:0:0:12 sds 65:32 active ready running |- 0:0:1:12 sdbh 67:176 active ready running |- 1:0:0:12 sdr 65:16 active ready running `- 1:0:1:12 sdbi 67:192 active ready running Start again here so its the same 5 minutes while we are in the error_handler Fri Apr 29 17:21:26 EDT 2016 mpathe (360014052a6f5f9f256d4c1097eedfd95) dm-2 LIO-ORG ,block-13 size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw `-+- policy='service-time 0' prio=1 status=active |- 0:0:0:12 sds 65:32 active ready running |- 0:0:1:12 sdbh 67:176 failed faulty offline |- 1:0:0:12 sdr 65:16 active ready running `- 1:0:1:12 sdbi 67:192 failed faulty offline Laurence Oberman Principal Software Maintenance Engineer Red Hat Global Support Services ----- Original Message ----- From: "Laurence Oberman" <loberman@xxxxxxxxxx> To: "Bart Van Assche" <bart.vanassche@xxxxxxxxxxx> Cc: linux-block@xxxxxxxxxxxxxxx, "linux-scsi" <linux-scsi@xxxxxxxxxxxxxxx>, "Mike Snitzer" <snitzer@xxxxxxxxxx>, "James Bottomley" <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx>, "device-mapper development" <dm-devel@xxxxxxxxxx>, lsf@xxxxxxxxxxxxxxxxxxxxxxxxxx, "Benjamin Marzinski" <bmarzins@xxxxxxxxxx> Sent: Friday, April 29, 2016 5:47:07 PM Subject: Re: [Lsf] Notes from the four separate IO track sessions at LSF/MM Hello Bart I will email the entire log just to you. This is a summary only below of pertinent log messages. I dont think the whole list will have an interest in all thge log messages. When I sent the dull log to you I will include SCSI debug for the error handler stuff. I ran the tests. This is a worst case test with 21 LUNS and jammed commands. Typical failures like a port switch failure or link down wont be like this. Also where we get RSCN's and we can react quicker we will. In this case I simulated a hung switch issue like a backplane/mesh problem and believe me I see a lot of these black-holed SCSI command situations in real life. Recovery with 21 LUNS is 300s that have in-flights to abort. This configuration is a multibus configuration for multipath. Two qla2xx ports are connected to a switch and the 2 array pots are connected to the same switch. This gives me 4 active/active paths for 21 mpath devices I start I/O to all 21 reading 64k blocks using dd and iflag=direct Example mpath device mpathf (360014056a5be89021364a4c90556bfbb) dm-7 LIO-ORG ,block-14 size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw `-+- policy='service-time 0' prio=1 status=active |- 0:0:0:13 sdp 8:240 active ready running |- 0:0:1:13 sdbf 67:144 active ready running |- 1:0:0:13 sdo 8:224 active ready running `- 1:0:1:13 sdbg 67:160 active ready running eh_deadline is set to 10 on the 2 qlogic ports, eh_timeout is set to 10 for all devices In multipath fast_io_fail_tmo=5 I jam one of the target array ports and discard the commands effectively black-holing the commands and leave it that way until we recover and I watch the I/O. The recovery takes around 300s even with all the tuning and this effectively lands up in Oracle cluster evictions. Watching multipath -ll mpathe I will block as expected while in recovery BLocked here Fri Apr 29 17:16:14 EDT 2016 mpathe (360014052a6f5f9f256d4c1097eedfd95) dm-2 LIO-ORG ,block-13 size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw `-+- policy='service-time 0' prio=1 status=active |- 0:0:0:12 sds 65:32 active ready running |- 0:0:1:12 sdbh 67:176 active ready running |- 1:0:0:12 sdr 65:16 active ready running `- 1:0:1:12 sdbi 67:192 active ready running Starte again here Fri Apr 29 17:16:26 EDT 2016 mpathe (360014052a6f5f9f256d4c1097eedfd95) dm-2 LIO-ORG ,block-13 size=20G features='1 queue_if_no_path' hwhandler='0' wp=rw `-+- policy='service-time 0' prio=1 status=active |- 0:0:0:12 sds 65:32 active ready running |- 0:0:1:12 sdbh 67:176 failed faulty offline |- 1:0:0:12 sdr 65:16 active ready running `- 1:0:1:12 sdbi 67:192 failed faulty offline Tracking I/O procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- -----timestamp----- r b swpd free buff cache si so bi bo in cs us sy id wa st EDT 0 21 0 15409656 25580 452056 0 0 13740 0 367 2523 0 1 41 59 0 2016-04-29 17:16:17 0 21 0 15408904 25580 452336 0 0 15872 0 378 2779 0 1 42 57 0 2016-04-29 17:16:18 2 20 0 15408096 25580 452624 0 0 17612 0 399 3310 0 0 26 73 0 2016-04-29 17:16:19 0 21 0 15407188 25580 453096 0 0 17860 0 412 3137 0 0 30 70 0 2016-04-29 17:16:20 0 21 0 15410420 25580 451552 0 0 23116 0 900 6747 0 1 31 69 0 2016-04-29 17:16:21 0 21 0 15410552 25580 451420 0 0 22664 0 430 3752 0 0 24 76 0 2016-04-29 17:16:22 0 21 0 15410552 25580 451420 0 0 15700 0 325 2619 0 0 25 75 0 2016-04-29 17:16:23 0 21 0 15410552 25580 451420 0 0 13648 0 303 2387 0 0 28 71 0 2016-04-29 17:16:24 .. .. Blocked .. Starts recovering ~= 300s seconds later .. 0 38 0 15406428 25860 452652 0 0 3208 0 859 2437 0 1 13 86 0 2016-04-29 17:21:20 0 38 0 15405668 26244 452268 0 0 6640 0 499 3575 0 1 0 99 0 2016-04-29 17:21:21 0 38 0 15406840 26496 452300 0 0 5372 0 273 1878 0 0 1 98 0 2016-04-29 17:21:22 0 38 0 15402684 29156 452048 0 0 9700 0 318 2326 0 0 11 88 0 2016-04-29 17:21:23 0 38 0 15400800 30152 452168 0 0 11876 0 433 3265 0 1 16 83 0 2016-04-29 17:21:24 0 38 0 15399792 31140 452344 0 0 11804 0 394 2902 0 1 15 85 0 2016-04-29 17:21:25 0 38 0 15398552 31952 452196 0 0 12908 0 417 3347 0 1 4 96 0 2016-04-29 17:21:26 0 35 0 15394564 32660 452800 0 0 10904 0 575 4191 1 1 9 89 0 2016-04-29 17:21:27 0 29 0 15394292 32968 452900 0 0 13356 0 602 3993 1 1 1 96 0 2016-04-29 17:21:28 0 26 0 15394464 33692 452196 0 0 16124 0 764 5451 1 1 2 96 0 2016-04-29 17:21:29 0 24 0 15394168 33880 452392 0 0 20156 0 479 3957 0 1 3 96 0 2016-04-29 17:21:30 0 24 0 15394216 34008 452460 0 0 21760 0 456 3836 0 1 6 94 0 2016-04-29 17:21:31 0 22 0 15393920 34016 452604 0 0 20104 28 437 3418 0 1 12 87 0 2016-04-29 17:21:32 0 22 0 15393952 34016 452600 0 0 20352 0 483 3259 0 1 67 32 0 2016-04-29 17:21:33 0 22 0 15394148 34016 452600 0 0 20560 0 451 3228 0 1 74 25 0 2016-04-29 17:21:34 I see the error handler start in the qlogic. Keep in mind we are black-holed so after RESET we start the process again. Apr 29 17:15:26 localhost root: Starting test with eh_deadline=10 Apr 29 17:16:54 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:9 -- 1 2002. Apr 29 17:16:54 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:7 -- 1 2002. Apr 29 17:16:54 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:8 -- 1 2002. Apr 29 17:16:55 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:6 -- 1 2002. Apr 29 17:16:55 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:4 -- 1 2002. Apr 29 17:16:55 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:3 -- 1 2002. Apr 29 17:16:55 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:22 -- 1 2002. Apr 29 17:16:55 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:23 -- 1 2002. Apr 29 17:16:55 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:21 -- 1 2002. Apr 29 17:16:56 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:2 -- 1 2002. Apr 29 17:16:56 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:19 -- 1 2002. Apr 29 17:16:56 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:18 -- 1 2002. Apr 29 17:16:56 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:20 -- 1 2002. Apr 29 17:16:56 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:1 -- 1 2002. Apr 29 17:16:56 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:0 -- 1 2002. Apr 29 17:16:57 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:17 -- 1 2002. Apr 29 17:16:57 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:16 -- 1 2002. Apr 29 17:16:57 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:10 -- 1 2002. Apr 29 17:16:58 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:14 -- 1 2002. Apr 29 17:16:58 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:13 -- 1 2002. Apr 29 17:16:58 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:12 -- 1 2002. Apr 29 17:16:58 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:0 -- 1 2002. Apr 29 17:17:00 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:10 -- 1 2002. Apr 29 17:17:00 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:12 -- 1 2002. Apr 29 17:17:01 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:0 -- 1 2002. Apr 29 17:17:02 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:1 -- 1 2002. Apr 29 17:17:03 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:13 -- 1 2002. Apr 29 17:17:03 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:14 -- 1 2002. Apr 29 17:17:03 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:16 -- 1 2002. Apr 29 17:17:04 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:17 -- 1 2002. Apr 29 17:17:09 localhost kernel: qla2xxx [0000:07:00.1]-8018:1: ADAPTER RESET ISSUED nexus=1:1:12. Apr 29 17:17:09 localhost kernel: qla2xxx [0000:07:00.1]-00af:1: Performing ISP error recovery - ha=ffff88042a4b0000. Apr 29 17:17:10 localhost kernel: qla2xxx [0000:07:00.1]-500a:1: LOOP UP detected (4 Gbps). Apr 29 17:17:10 localhost kernel: qla2xxx [0000:07:00.1]-8017:1: ADAPTER RESET SUCCEEDED nexus=1:1:12. Apr 29 17:17:30 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:12 -- 1 2002. Apr 29 17:17:34 localhost kernel: qla2xxx [0000:07:00.0]-8018:0: ADAPTER RESET ISSUED nexus=0:1:17. Apr 29 17:17:34 localhost kernel: qla2xxx [0000:07:00.0]-00af:0: Performing ISP error recovery - ha=ffff88042b030000. Apr 29 17:17:35 localhost kernel: qla2xxx [0000:07:00.0]-500a:0: LOOP UP detected (4 Gbps). Apr 29 17:17:35 localhost kernel: qla2xxx [0000:07:00.0]-8017:0: ADAPTER RESET SUCCEEDED nexus=0:1:17. Apr 29 17:17:40 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:10 -- 1 2002. Apr 29 17:17:50 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:1 -- 1 2002. Apr 29 17:17:56 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:17 -- 1 2002. Apr 29 17:18:00 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:0 -- 1 2002. Apr 29 17:18:06 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:18 -- 1 2002. Apr 29 17:18:10 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:13 -- 1 2002. Apr 29 17:18:16 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:19 -- 1 2002. Apr 29 17:18:20 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:8 -- 1 2002. Apr 29 17:18:26 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:8 -- 1 2002. Apr 29 17:18:30 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:7 -- 1 2002. Apr 29 17:18:36 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:22 -- 1 2002. Apr 29 17:18:40 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:22 -- 1 2002. Apr 29 17:18:46 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:21 -- 1 2002. Apr 29 17:18:50 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:21 -- 1 2002. .. .. We start seeing the hung tasks Apr 29 17:19:16 localhost kernel: INFO: task dd:10193 blocked for more than 120 seconds. Apr 29 17:19:16 localhost kernel: Not tainted 4.6.0-rc5+ #1 Apr 29 17:19:16 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 29 17:19:16 localhost kernel: dd D ffff8804303079d8 0 10193 10177 0x00000080 Apr 29 17:19:16 localhost kernel: ffff8804303079d8 ffff880434814140 ffff8800a86515c0 ffff880430308000 Apr 29 17:19:16 localhost kernel: 0000000000000000 7fffffffffffffff 0000000000000000 ffff8804307bfd00 Apr 29 17:19:16 localhost kernel: ffff8804303079f0 ffffffff816ba8e5 ffff880436696e00 ffff880430307aa0 Apr 29 17:19:16 localhost kernel: Call Trace: Apr 29 17:19:16 localhost kernel: [<ffffffff816ba8e5>] schedule+0x35/0x80 Apr 29 17:19:16 localhost kernel: [<ffffffff816bd661>] schedule_timeout+0x231/0x2d0 Apr 29 17:19:16 localhost kernel: [<ffffffff81315843>] ? __blk_run_queue+0x33/0x40 Apr 29 17:19:16 localhost kernel: [<ffffffff813158ba>] ? queue_unplugged+0x2a/0xb0 Apr 29 17:19:16 localhost kernel: [<ffffffff816b9f66>] io_schedule_timeout+0xa6/0x110 Apr 29 17:19:16 localhost kernel: [<ffffffff81259332>] do_blockdev_direct_IO+0x1b52/0x2180 Apr 29 17:19:16 localhost kernel: [<ffffffff81254320>] ? I_BDEV+0x20/0x20 Apr 29 17:19:16 localhost kernel: [<ffffffff812599a3>] __blockdev_direct_IO+0x43/0x50 Apr 29 17:19:16 localhost kernel: [<ffffffff81254a7c>] blkdev_direct_IO+0x4c/0x50 Apr 29 17:19:16 localhost kernel: [<ffffffff81193ab1>] generic_file_read_iter+0x641/0x7b0 Apr 29 17:19:16 localhost kernel: [<ffffffff8120bcf5>] ? mem_cgroup_commit_charge+0x85/0x100 Apr 29 17:19:16 localhost kernel: [<ffffffff81254e57>] blkdev_read_iter+0x37/0x40 Apr 29 17:19:16 localhost kernel: [<ffffffff81219379>] __vfs_read+0xc9/0x100 Apr 29 17:19:16 localhost kernel: [<ffffffff8121a1ef>] vfs_read+0x7f/0x130 Apr 29 17:19:16 localhost kernel: [<ffffffff8121b6d5>] SyS_read+0x55/0xc0 Apr 29 17:19:16 localhost kernel: [<ffffffff81003c12>] do_syscall_64+0x62/0x110 Apr 29 17:19:16 localhost kernel: [<ffffffff816be4a1>] entry_SYSCALL64_slow_path+0x25/0x25 .. .. Finally after the serialized timeouts we get the offline states Apr 29 17:19:26 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:3 -- 1 2002. Apr 29 17:19:30 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:17 -- 1 2002. Apr 29 17:19:36 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:7 -- 1 2002. Apr 29 17:19:40 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:18 -- 1 2002. Apr 29 17:19:46 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:4 -- 1 2002. Apr 29 17:19:51 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:19 -- 1 2002. Apr 29 17:19:56 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:6 -- 1 2002. Apr 29 17:20:01 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:2 -- 1 2002. Apr 29 17:20:06 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:9 -- 1 2002. Apr 29 17:20:11 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:23 -- 1 2002. Apr 29 17:20:16 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:12 -- 1 2002. Apr 29 17:20:21 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:3 -- 1 2002. Apr 29 17:20:26 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:10 -- 1 2002. Apr 29 17:20:31 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:4 -- 1 2002. Apr 29 17:20:37 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:0 -- 1 2002. Apr 29 17:20:41 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:6 -- 1 2002. Apr 29 17:20:47 localhost kernel: qla2xxx [0000:07:00.0]-801c:0: Abort command issued nexus=0:1:1 -- 1 2002. Apr 29 17:20:51 localhost kernel: qla2xxx [0000:07:00.1]-801c:1: Abort command issued nexus=1:1:9 -- 1 2002. Apr 29 17:20:51 localhost kernel: sd 1:0:1:12: Device offlined - not ready after error recovery Apr 29 17:20:51 localhost kernel: sd 1:0:1:10: Device offlined - not ready after error recovery Apr 29 17:20:51 localhost kernel: sd 1:0:1:1: Device offlined - not ready after error recovery Apr 29 17:20:51 localhost kernel: sd 1:0:1:0: Device offlined - not ready after error recovery Apr 29 17:20:51 localhost kernel: sd 1:0:1:13: Device offlined - not ready after error recovery Apr 29 17:20:51 localhost kernel: device-mapper: multipath: Failing path 66:208. Apr 29 17:20:51 localhost kernel: device-mapper: multipath: Failing path 68:192. Apr 29 17:20:51 localhost kernel: device-mapper: multipath: Failing path 67:224. Apr 29 17:20:51 localhost kernel: device-mapper: multipath: Failing path 67:192. Apr 29 17:20:51 localhost multipathd: mpatha: sdat - path offline Apr 29 17:20:51 localhost multipathd: checker failed path 66:208 in map mpatha Apr 29 17:20:51 localhost multipathd: mpatha: remaining active paths: 3 Apr 29 17:20:51 localhost multipathd: mpathb: sdby - path offline Apr 29 17:20:51 localhost multipathd: checker failed path 68:192 in map mpathb Apr 29 17:20:51 localhost multipathd: mpathb: remaining active paths: 3 Apr 29 17:20:51 localhost multipathd: mpathc: sdbk - path offline Apr 29 17:20:51 localhost multipathd: checker failed path 67:224 in map mpathc Apr 29 17:20:51 localhost multipathd: mpathc: remaining active paths: 3 Apr 29 17:20:51 localhost multipathd: mpathe: sdbi - path offline Apr 29 17:20:51 localhost multipathd: checker failed path 67:192 in map mpathe Apr 29 17:20:51 localhost multipathd: mpathe: remaining active paths: 3 Apr 29 17:20:51 localhost kernel: sd 1:0:1:8: Device offlined - not ready after error recovery Apr 29 17:20:51 localhost kernel: sd 1:0:1:7: Device offlined - not ready after error recovery Apr 29 17:20:51 localhost kernel: sd 1:0:1:22: Device offlined - not ready after error recovery Apr 29 17:20:51 localhost kernel: sd 1:0:1:21: Device offlined - not ready after error recovery Apr 29 17:20:51 localhost kernel: sd 1:0:1:20: Device offlined - not ready after error recovery Apr 29 17:20:51 localhost kernel: sd 1:0:1:14: Device offlined - not ready after error recovery Apr 29 17:20:51 localhost kernel: sd 1:0:1:16: Device offlined - not ready after error recovery Apr 29 17:20:51 localhost kernel: sd 1:0:1:17: Device offlined - not ready after error recovery Apr 29 17:20:51 localhost kernel: sd 1:0:1:18: Device offlined - not ready after error recovery Apr 29 17:20:51 localhost kernel: sd 1:0:1:19: Device offlined - not ready after error recovery Apr 29 17:20:51 localhost kernel: sd 1:0:1:2: Device offlined - not ready after error recovery .. .. Apr 29 17:20:51 localhost kernel: sd 1:0:1:12: [sdbi] tag#14 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK Apr 29 17:20:52 localhost kernel: sd 1:0:1:12: [sdbi] tag#14 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00 Apr 29 17:20:52 localhost kernel: blk_update_request: I/O error, dev sdbi, sector 0 Apr 29 17:20:52 localhost kernel: sd 1:0:1:12: rejecting I/O to offline device Apr 29 17:20:52 localhost kernel: sd 1:0:1:12: [sdbi] killing request Apr 29 17:20:52 localhost kernel: sd 1:0:1:12: [sdbi] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Apr 29 17:20:52 localhost kernel: sd 1:0:1:12: [sdbi] CDB: Read(10) 28 00 00 02 5b 80 00 00 80 00 Apr 29 17:20:52 localhost kernel: blk_update_request: I/O error, dev sdbi, sector 154496 Apr 29 17:20:52 localhost kernel: sd 1:0:1:10: [sdbk] tag#13 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK Apr 29 17:20:52 localhost kernel: sd 1:0:1:10: [sdbk] tag#13 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00 Apr 29 17:20:52 localhost kernel: blk_update_request: I/O error, dev sdbk, sector 0 Apr 29 17:20:52 localhost kernel: sd 1:0:1:1: [sdby] tag#16 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK Apr 29 17:20:52 localhost kernel: sd 1:0:1:1: [sdby] tag#16 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00 Apr 29 17:20:52 localhost multipathd: mpathf: sdbg - path offline Apr 29 17:20:52 localhost multipathd: checker failed path 67:160 in map mpathf Apr 29 17:20:52 localhost multipathd: mpathf: remaining active paths: 3 Apr 29 17:20:52 localhost multipathd: mpathg: sdbe - path offline Apr 29 17:20:52 localhost multipathd: checker failed path 67:128 in map mpathg Apr 29 17:20:52 localhost multipathd: mpathg: remaining active paths: 3 Apr 29 17:20:52 localhost multipathd: mpathi: sdbc - path offline Apr 29 17:20:52 localhost multipathd: checker failed path 67:96 in map mpathi Apr 29 17:20:52 localhost multipathd: mpathi: remaining active paths: 3 Apr 29 17:20:52 localhost multipathd: mpathj: sdba - path offline Apr 29 17:20:52 localhost multipathd: checker failed path 67:64 in map mpathj Apr 29 17:20:52 localhost multipathd: mpathj: remaining active paths: 3 Apr 29 17:20:52 localhost multipathd: mpathk: sday - path offline Apr 29 17:20:52 localhost multipathd: checker failed path 67:32 in map mpathk Apr 29 17:20:52 localhost multipathd: mpathk: remaining active paths: 3 .. .. Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 67:160. Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 67:128. Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 67:96. Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 67:64. Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 67:32. Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 66:240. Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:160. Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 69:0. Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 69:32. Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 69:64. Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 69:96. Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:128. Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:96. Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:64. Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:224. Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:32. Apr 29 17:20:52 localhost kernel: device-mapper: multipath: Failing path 68:0. Apr 29 17:20:52 localhost kernel: blk_update_request: I/O error, dev sdby, sector 0 Apr 29 17:20:52 localhost kernel: sd 1:0:1:0: [sdat] tag#15 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK Apr 29 17:20:52 localhost kernel: sd 1:0:1:0: [sdat] tag#15 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00 Apr 29 17:20:52 localhost kernel: blk_update_request: I/O error, dev sdat, sector 0 Apr 29 17:20:52 localhost kernel: sd 1:0:1:13: [sdbg] tag#17 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK Apr 29 17:20:52 localhost kernel: sd 1:0:1:13: [sdbg] tag#17 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00 Apr 29 17:20:53 localhost kernel: blk_update_request: I/O error, dev sdbg, sector 0 Apr 29 17:20:53 localhost kernel: sd 1:0:1:13: rejecting I/O to offline device Apr 29 17:20:53 localhost kernel: sd 1:0:1:13: [sdbg] killing request Apr 29 17:20:53 localhost kernel: sd 1:0:1:13: [sdbg] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Apr 29 17:20:53 localhost kernel: sd 1:0:1:13: [sdbg] CDB: Read(10) 28 00 00 02 5d 80 00 00 80 00 Apr 29 17:20:53 localhost kernel: blk_update_request: I/O error, dev sdbg, sector 155008 Apr 29 17:20:53 localhost kernel: sd 1:0:1:8: [sdbo] tag#31 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK Apr 29 17:20:53 localhost kernel: sd 1:0:1:8: [sdbo] tag#31 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00 Apr 29 17:20:53 localhost kernel: blk_update_request: I/O error, dev sdbo, sector 0 Apr 29 17:20:53 localhost kernel: sd 1:0:1:7: [sdca] tag#30 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK Apr 29 17:20:53 localhost kernel: sd 1:0:1:7: [sdca] tag#30 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00 Apr 29 17:20:53 localhost kernel: blk_update_request: I/O error, dev sdca, sector 0 Apr 29 17:20:53 localhost kernel: sd 1:0:1:22: [sdcg] tag#26 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK .. Apr 29 17:21:18 localhost multipathd: checker failed path 66:224 in map mpatha Apr 29 17:21:18 localhost multipathd: mpatha: remaining active paths: 2 Apr 29 17:21:18 localhost multipathd: mpathb: sdbx - path offline Apr 29 17:21:18 localhost multipathd: checker failed path 68:176 in map mpathb Apr 29 17:21:18 localhost multipathd: mpathb: remaining active paths: 2 Apr 29 17:21:18 localhost multipathd: mpathc: sdbj - path offline Apr 29 17:21:18 localhost multipathd: checker failed path 67:208 in map mpathc Apr 29 17:21:18 localhost multipathd: mpathc: remaining active paths: 2 Apr 29 17:21:18 localhost multipathd: mpathe: sdbh - path offline Apr 29 17:21:18 localhost multipathd: checker failed path 67:176 in map mpathe Apr 29 17:21:18 localhost multipathd: mpathe: remaining active paths: 2 Apr 29 17:21:18 localhost multipathd: mpathf: sdbf - path offline Apr 29 17:21:18 localhost multipathd: checker failed path 67:144 in map mpathf Apr 29 17:21:18 localhost multipathd: mpathf: remaining active paths: 2 Apr 29 17:21:18 localhost multipathd: mpathg: sdbd - path offline Apr 29 17:21:18 localhost multipathd: checker failed path 67:112 in map mpathg Apr 29 17:21:18 localhost multipathd: mpathg: remaining active paths: 2 Apr 29 17:21:18 localhost multipathd: mpathi: sdbb - path offline Apr 29 17:21:18 localhost multipathd: checker failed path 67:80 in map mpathi Apr 29 17:21:18 localhost multipathd: mpathi: remaining active paths: 2 Apr 29 17:21:18 localhost multipathd: mpatho: sdbr - path offline Apr 29 17:21:18 localhost multipathd: checker failed path 68:80 in map mpatho Apr 29 17:21:18 localhost multipathd: mpatho: remaining active paths: 2 Apr 29 17:21:18 localhost multipathd: mpathq: sdbp - path offline Apr 29 17:21:18 localhost multipathd: checker failed path 68:48 in map mpathq Apr 29 17:21:18 localhost multipathd: mpathq: remaining active paths: 2 Apr 29 17:21:18 localhost multipathd: mpathv: sdbz - path offline Apr 29 17:21:18 localhost multipathd: checker failed path 68:208 in map mpathv Apr 29 17:21:18 localhost multipathd: mpathv: remaining active paths: 2 Apr 29 17:21:18 localhost multipathd: mpatht: sdbl - path offline Apr 29 17:21:18 localhost multipathd: checker failed path 67:240 in map mpatht Apr 29 17:21:18 localhost multipathd: mpatht: remaining active paths: 2 Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 66:224. Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 68:176. Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 67:208. Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 67:176. Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 67:144. Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 67:112. Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 67:80. Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 68:80. Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 68:48. Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 68:208. Apr 29 17:21:18 localhost kernel: device-mapper: multipath: Failing path 67:240. Apr 29 17:21:18 localhost kernel: blk_update_request: I/O error, dev sdaw, sector 0 Apr 29 17:21:18 localhost kernel: sd 0:0:1:8: [sdbn] tag#25 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK Apr 29 17:21:18 localhost kernel: sd 0:0:1:8: [sdbn] tag#25 CDB: Read(10) 28 00 00 00 00 00 00 00 08 00 Apr 29 17:21:18 localhost kernel: blk_update_request: I/O error, dev sdbn, sector 0 Apr 29 17:21:18 localhost kernel: sd 0:0:1:8: rejecting I/O to offline device Apr 29 17:21:18 localhost kernel: sd 0:0:1:8: [sdbn] killing request Laurence Oberman Principal Software Maintenance Engineer Red Hat Global Support Services ----- Original Message ----- From: "Laurence Oberman" <loberman@xxxxxxxxxx> To: "Bart Van Assche" <bart.vanassche@xxxxxxxxxxx> Cc: linux-block@xxxxxxxxxxxxxxx, "linux-scsi" <linux-scsi@xxxxxxxxxxxxxxx>, "Mike Snitzer" <snitzer@xxxxxxxxxx>, "James Bottomley" <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx>, "device-mapper development" <dm-devel@xxxxxxxxxx>, lsf@xxxxxxxxxxxxxxxxxxxxxxxxxx Sent: Thursday, April 28, 2016 12:47:24 PM Subject: Re: [Lsf] Notes from the four separate IO track sessions at LSF/MM Hello Bart, This is when we have a subset of the paths fails. As you know the remaining path wont be used until the eh_handler is either done or is short circuited. What I will do is set this up via my jammer and capture a test using latest upstream. Of course my customer pain points are all in the RHEL kernels so I need to capture a recovery trace on the latest upstream kernel. When the SCSI commands for a path are black-holed and remain that way, even with eh_deadline and the short circuited adapter resets we simply try again and get back in the wait loop until we finally declare the device offline. This can take a while and differs depending on Qlogic, Emulex or fnic etc. First thing tomorrow will set this up and show you what I mean. Thanks!! Laurence Oberman Principal Software Maintenance Engineer Red Hat Global Support Services ----- Original Message ----- From: "Bart Van Assche" <bart.vanassche@xxxxxxxxxxx> To: "Laurence Oberman" <loberman@xxxxxxxxxx> Cc: linux-block@xxxxxxxxxxxxxxx, "linux-scsi" <linux-scsi@xxxxxxxxxxxxxxx>, "Mike Snitzer" <snitzer@xxxxxxxxxx>, "James Bottomley" <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx>, "device-mapper development" <dm-devel@xxxxxxxxxx>, lsf@xxxxxxxxxxxxxxxxxxxxxxxxxx Sent: Thursday, April 28, 2016 12:41:26 PM Subject: Re: [Lsf] Notes from the four separate IO track sessions at LSF/MM On 04/28/2016 09:23 AM, Laurence Oberman wrote: > We still suffer from periodic complaints in our large customer base > regarding the long recovery times for dm-multipath. > Most of the time this is when we have something like a switch > back-plane issue or an issue where RSCN'S are blocked coming back up > the fabric. Corner cases still bite us often. > > Most of the complaints originate from customers for example seeing > Oracle cluster evictions where during the waiting on the mid-layer > all mpath I/O is blocked until recovery. > > We have to tune eh_deadline, eh_timeout and fast_io_fail_tmo but > even tuning those we have to wait on serial recovery even if we > set the timeouts low. > > Lately we have been living with > eh_deadline=10 > eh_timeout=5 > fast_fail_io_tmo=10 > leaving default sd timeout at 30s > > So this continues to be an issue and I have specific examples using > the jammer I can provide showing the serial recovery times here. Hello Laurence, The long recovery times you refer to, is that for a scenario where all paths failed or for a scenario where some paths failed and other paths are still working? In the latter case, how long does it take before dm-multipath fails over to another path? Thanks, Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel