On 04/10/2017 10:17 PM, Mauricio Faria de Oliveira wrote:
For documentation purposes, I'll reply to this cover letter with the analysis
of such cases of this problem, and the accompanying messages from kernel logs.
Here it goes, for anyone interested.
Scenario: 4 LUNs, 2 target port groups (PGs), 2 paths per PG.
mpatha (360050764008100dac000000000000100) dm-0 IBM,2145
size=40G features='2 queue_if_no_path retain_attached_hw_handler'
hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=0 status=active
| |- 1:0:1:0 sdf 8:80 active undef running
| `- 2:0:1:0 sdn 8:208 active undef running
`-+- policy='service-time 0' prio=0 status=enabled
|- 1:0:0:0 sdb 8:16 active undef running
`- 2:0:0:0 sdj 8:144 active undef running
This LUN's active port group becomes Unavailable.
The ongoing I/O requests sense it via SCSI command errors.
Apr 6 07:14:56 ltcalpine-lp2 kernel: sd 1:0:1:0: [sdf] tag#0 Add.
Sense: Logical unit not accessible, target port in unavailable state
Apr 6 07:14:56 ltcalpine-lp2 kernel: sd 2:0:1:0: [sdn] tag#0 Add.
Sense: Logical unit not accessible, target port in unavailable state
Accordingly, dm-mpath fails both paths:
Apr 6 07:14:56 ltcalpine-lp2 kernel: device-mapper: multipath: Failing
path 8:80.
Apr 6 07:14:56 ltcalpine-lp2 kernel: device-mapper: multipath: Failing
path 8:208.
The SCSI command error has gone through alua_check_sense().
It scheduled an alua_rtpg(), which confirms its port group
(for 1:0:1:0) and the other (for 2:0:0:0) are unavailable.
Apr 6 07:14:56 ltcalpine-lp2 kernel: sd 2:0:0:0: alua: port group 01
state U non-preferred
Apr 6 07:14:56 ltcalpine-lp2 kernel: sd 1:0:1:0: alua: port group 00
state U non-preferred supports tolusna
Failing over, dm-mpath switched path groups, to the other/enabled port
group.
Notice that further SCSI command errors _happen_, but are not _reported_,
as alua_rtpg() has set pg->state = unavailable before more I/O requests
eventually reached alua_prep_fn(), which sets them with RQF_QUIET.
Accordingly, dm-mpath fails both paths:
Apr 6 07:14:56 ltcalpine-lp2 kernel: device-mapper: multipath: Failing
path 8:144.
Apr 6 07:14:56 ltcalpine-lp2 kernel: device-mapper: multipath: Failing
path 8:16.
The path checkers are running, and constantly receiving senses of
unavailable state
(seen through debug messages), and alua_rtpg() rechecks are scheduled,
but no
repeated messages for 'state U' are printed, as it has not changed at all.
After a few minutes, both port groups become active
(optimized/non-optimized),
and dm-mpath reinstate all paths:
Apr 6 07:21:01 ltcalpine-lp2 kernel: device-mapper: multipath:
Reinstating path 8:16.
Apr 6 07:21:01 ltcalpine-lp2 kernel: device-mapper: multipath:
Reinstating path 8:80.
Apr 6 07:21:01 ltcalpine-lp2 kernel: device-mapper: multipath:
Reinstating path 8:144.
Apr 6 07:21:01 ltcalpine-lp2 kernel: device-mapper: multipath:
Reinstating path 8:208.
The dm-mpath reinstate calls alua_activate() -> alua_rtpg() for both
paths in the active port group.
The first alua_rtpg() worker is still running when the second
alua_rtpg() is scheduled,
so pg->rtpg_sdev does not change, and it's reported as 1:0:1:0 twice
(but called for 1:0:1:0 and 2:0:1:0).
Apr 6 07:21:01 ltcalpine-lp2 kernel: sd 2:0:0:0: alua: port group 01
state N non-preferred
Apr 6 07:21:01 ltcalpine-lp2 kernel: sd 1:0:1:0: alua: port group 00
state A non-preferred supports tolusna
Apr 6 07:21:01 ltcalpine-lp2 kernel: sd 2:0:0:0: alua: port group 01
state N non-preferred
Apr 6 07:21:01 ltcalpine-lp2 kernel: sd 1:0:1:0: alua: port group 00
state A non-preferred supports tolusna
...
mpathb (360050764008100dac000000000000101) dm-1 IBM,2145
size=40G features='2 queue_if_no_path retain_attached_hw_handler'
hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=0 status=active
| |- 1:0:0:1 sdc 8:32 active undef running
| `- 2:0:0:1 sdk 8:160 active undef running
`-+- policy='service-time 0' prio=0 status=enabled
|- 1:0:1:1 sdg 8:96 active undef running
`- 2:0:1:1 sdo 8:224 active undef running
Similarly,
Apr 5 19:59:52 ltcalpine-lp2 kernel: sd 2:0:0:1: [sdk] tag#2 Add.
Sense: Logical unit not accessible, target port in unavailable state
Apr 5 19:59:52 ltcalpine-lp2 kernel: sd 1:0:0:1: [sdc] tag#5 Add.
Sense: Logical unit not accessible, target port in unavailable state
Apr 5 19:59:52 ltcalpine-lp2 kernel: device-mapper: multipath: Failing
path 8:160.
Apr 5 19:59:52 ltcalpine-lp2 kernel: device-mapper: multipath: Failing
path 8:32.
In this case alua_rtpg() does not print messages for the 'other' port group
(without 'supports'), because the other PG worker was running at the time.
There was an alua_rtpg() scheduled due to alua_check_sense() for 2:0:0:1,
and another for alua_activate() of 1:0:1:1 (from the 'enabled' path group).
Apr 5 19:59:52 ltcalpine-lp2 kernel: sd 2:0:0:1: alua: port group 00
state U non-preferred supports tolusna
Apr 5 19:59:52 ltcalpine-lp2 kernel: sd 1:0:1:1: alua: port group 01
state U non-preferred supports tolusna
Apr 5 19:59:52 ltcalpine-lp2 kernel: device-mapper: multipath: Failing
path 8:96.
Apr 5 19:59:52 ltcalpine-lp2 kernel: device-mapper: multipath: Failing
path 8:224.
After a few minutes,
Apr 5 20:53:01 ltcalpine-lp2 kernel: device-mapper: multipath:
Reinstating path 8:96.
Apr 5 20:53:01 ltcalpine-lp2 kernel: device-mapper: multipath:
Reinstating path 8:224.
Apr 5 20:53:01 ltcalpine-lp2 kernel: sd 2:0:0:1: alua: port group 00
state A non-preferred
Apr 5 20:53:01 ltcalpine-lp2 kernel: sd 1:0:1:1: alua: port group 01
state N non-preferred supports tolusna
Apr 5 20:53:01 ltcalpine-lp2 kernel: sd 2:0:0:1: alua: port group 00
state A non-preferred
Apr 5 20:53:01 ltcalpine-lp2 kernel: sd 1:0:1:1: alua: port group 01
state N non-preferred supports tolusna
Apr 5 20:53:06 ltcalpine-lp2 kernel: device-mapper: multipath:
Reinstating path 8:32.
Apr 5 20:53:06 ltcalpine-lp2 kernel: device-mapper: multipath:
Reinstating path 8:160.
...
mpathc (360050764008100dac000000000000141) dm-2 IBM,2145
size=40G features='2 queue_if_no_path retain_attached_hw_handler'
hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=0 status=active
| |- 1:0:0:2 sdd 8:48 active undef running
| `- 2:0:0:2 sdl 8:176 active undef running
`-+- policy='service-time 0' prio=0 status=enabled
|- 1:0:1:2 sdh 8:112 active undef running
`- 2:0:1:2 sdp 8:240 active undef running
With this one, no SCSI command error messages because
the path checkers noticed the unavailable state first.
Apr 5 19:59:52 ltcalpine-lp2 kernel: device-mapper: multipath: Failing
path 8:176.
Apr 5 19:59:52 ltcalpine-lp2 kernel: device-mapper: multipath: Failing
path 8:48.
In this case alua_rtpg() does not print messages for the 'other' port group
(same reason as described in previous case). Both port groups are
unavailable.
Apr 5 19:59:52 ltcalpine-lp2 kernel: sd 2:0:0:2: alua: port group 00
state U non-preferred supports tolusna
Apr 5 19:59:52 ltcalpine-lp2 kernel: sd 1:0:1:2: alua: port group 01
state U non-preferred supports tolusna
Apr 5 19:59:52 ltcalpine-lp2 kernel: device-mapper: multipath: Failing
path 8:240.
Apr 5 19:59:52 ltcalpine-lp2 kernel: device-mapper: multipath: Failing
path 8:112.
After a few minutes, they return to active (opt/non-opt) state.
Apr 5 20:53:01 ltcalpine-lp2 kernel: device-mapper: multipath:
Reinstating path 8:112.
Apr 5 20:53:01 ltcalpine-lp2 kernel: sd 2:0:0:2: alua: port group 00
state A non-preferred
Apr 5 20:53:01 ltcalpine-lp2 kernel: sd 1:0:1:2: alua: port group 01
state N non-preferred supports tolusna
Apr 5 20:53:04 ltcalpine-lp2 kernel: device-mapper: multipath:
Reinstating path 8:176.
Apr 5 20:53:04 ltcalpine-lp2 kernel: sd 2:0:1:2: alua: port group 01
state N non-preferred
Apr 5 20:53:04 ltcalpine-lp2 kernel: sd 2:0:0:2: alua: port group 00
state A non-preferred supports tolusna
Apr 5 20:53:05 ltcalpine-lp2 kernel: device-mapper: multipath:
Reinstating path 8:48.
Apr 5 20:53:05 ltcalpine-lp2 kernel: sd 2:0:1:2: alua: port group 01
state N non-preferred
Apr 5 20:53:05 ltcalpine-lp2 kernel: sd 1:0:0:2: alua: port group 00
state A non-preferred supports tolusna
Apr 5 20:53:06 ltcalpine-lp2 kernel: device-mapper: multipath:
Reinstating path 8:240.
...
mpathd (360050764008100dac000000000000142) dm-3 IBM,2145
size=40G features='2 queue_if_no_path retain_attached_hw_handler'
hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=0 status=active
| |- 1:0:1:3 sdi 8:128 active undef running
| `- 2:0:1:3 sdq 65:0 active undef running
`-+- policy='service-time 0' prio=0 status=enabled
|- 1:0:0:3 sde 8:64 active undef running
`- 2:0:0:3 sdm 8:192 active undef running
Similarly to the cases above, with some duplicated messages for the RTPG
information
because alua_activate() schedules both ALUA_PG_RUN_[RS]TPG, so
alua_rtpg_work() first
runs ALUA_PG_RUN_RTPG / alua_rtpg() which prints the message once, and
then goes into
ALUA_PG_RUN_STPG / alua_stpg(), but as this storage system only supports
implict TGPS,
it falls into SCSI_DH_RETRY, which schedules another
ALUA_PG_RUN_RTPG/alua_rtpg() and
it prints the message again.
Apr 6 07:14:56 ltcalpine-lp2 kernel: sd 2:0:1:3: [sdq] tag#5 Add.
Sense: Logical unit not accessible, target port in unavailable state
Apr 6 07:14:56 ltcalpine-lp2 kernel: device-mapper: multipath: Failing
path 65:0.
Apr 6 07:14:56 ltcalpine-lp2 kernel: sd 1:0:1:3: [sdi] tag#8 Add.
Sense: Logical unit not accessible, target port in unavailable state
Apr 6 07:14:56 ltcalpine-lp2 kernel: device-mapper: multipath: Failing
path 8:128.
Apr 6 07:14:56 ltcalpine-lp2 kernel: sd 2:0:0:3: alua: port group 01
state U non-preferred
Apr 6 07:14:56 ltcalpine-lp2 kernel: sd 1:0:1:3: alua: port group 00
state U non-preferred supports tolusna
Apr 6 07:14:56 ltcalpine-lp2 kernel: device-mapper: multipath: Failing
path 8:192.
Apr 6 07:14:56 ltcalpine-lp2 kernel: device-mapper: multipath: Failing
path 8:64.
Apr 6 07:47:25 ltcalpine-lp2 kernel: device-mapper: multipath:
Reinstating path 8:192.
Apr 6 07:47:25 ltcalpine-lp2 kernel: sd 2:0:1:3: alua: port group 00
state A non-preferred
Apr 6 07:47:25 ltcalpine-lp2 kernel: sd 2:0:0:3: alua: port group 01
state N non-preferred supports tolusna
Apr 6 07:47:25 ltcalpine-lp2 kernel: sd 2:0:1:3: alua: port group 00
state A non-preferred
Apr 6 07:47:25 ltcalpine-lp2 kernel: sd 2:0:0:3: alua: port group 01
state N non-preferred supports tolusna
Apr 6 07:47:26 ltcalpine-lp2 kernel: device-mapper: multipath:
Reinstating path 8:64.
Apr 6 07:47:26 ltcalpine-lp2 kernel: device-mapper: multipath:
Reinstating path 8:128.
Apr 6 07:47:26 ltcalpine-lp2 kernel: device-mapper: multipath:
Reinstating path 65:0.
Apr 6 07:47:26 ltcalpine-lp2 kernel: sd 2:0:1:3: alua: port group 00
state A non-preferred
Apr 6 07:47:26 ltcalpine-lp2 kernel: sd 1:0:0:3: alua: port group 01
state N non-preferred supports tolusna
Apr 6 07:47:26 ltcalpine-lp2 kernel: sd 2:0:1:3: alua: port group 00
state A non-preferred
Apr 6 07:47:26 ltcalpine-lp2 kernel: sd 1:0:0:3: alua: port group 01
state N non-preferred supports tolusna
Apr 6 07:47:26 ltcalpine-lp2 kernel: sd 2:0:0:3: alua: port group 01
state N non-preferred
Apr 6 07:47:26 ltcalpine-lp2 kernel: sd 1:0:1:3: alua: port group 00
state A non-preferred supports tolusna
Apr 6 07:47:26 ltcalpine-lp2 kernel: sd 2:0:0:3: alua: port group 01
state N non-preferred
Apr 6 07:47:26 ltcalpine-lp2 kernel: sd 1:0:1:3: alua: port group 00
state A non-preferred supports tolusna
--
Mauricio Faria de Oliveira
IBM Linux Technology Center