[Bug 93951] New: Multipath hangs if Active Alua path stops responding (timeouts)

bugzilla-daemon@xxxxxxxxxxxxxxxxxxx · Fri, 27 Feb 2015 16:48:24 +0000

https://bugzilla.kernel.org/show_bug.cgi?id=93951

            Bug ID: 93951
           Summary: Multipath hangs if Active Alua path stops responding
                    (timeouts)
           Product: IO/Storage
           Version: 2.5
    Kernel Version: 3.10.0-123.el7.x86_64
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: SCSI
          Assignee: linux-scsi@xxxxxxxxxxxxxxx
          Reporter: probeless@xxxxxxx
        Regression: No

We are observing multipath hanging if the Active path of an Active/Optimized -
Standby ALUA SAS LUN stops responding to SCSI commands (constant timeouts).

To summarize our setup, the storage device is connected via SAS and supports
ALUA Active/Optimized and Standby states.  A typical user configuration would
have 2 paths to each LUN.  One path would be Active/Optimized and the other
path would be in the Standby state.  

Below is our current multipath.conf settings that was used to collect the
latest set of logs.  Note, some of settings we set in attempt to work around
the issue but did not seem to have any effect.

 devices {
    device {
        vendor XXXXX
        product "XXXXX Vol"
        path_checker tur
        prio alua
        path_grouping_policy group_by_prio
    features "2 pg_init_retries 50"
        hardware_handler "1 alua"
        failback immediate
        rr_weight priorities
    no_path_retry 5
    dev_loss_tmo 60
    path_selector "service-time 0"
    }

When everything is operating nominally, "multipath -ll" shows the expected 2
paths to each LUN, the Active/Optimized path has a priority of 50, and the
Standby path has a priority of 1.  Example:

mpathi (36000d31001108300000000000000003b) dm-9 XXXXX,XXXXX Vol  
size=500G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua'
wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 1:0:1:7  sdr 65:16 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 1:0:0:7  sdh 8:112 active ready running

/var/log/messages shows the path_checker running every ~5 seconds as expected:
Feb 24 13:18:22 localhost multipathd: sdr: path state = running
Feb 24 13:18:22 localhost multipathd: sdr: get_state
Feb 24 13:18:22 localhost multipathd: 65:16: tur checker starting up
Feb 24 13:18:22 localhost multipathd: 65:16: tur checker finished, state up
Feb 24 13:18:22 localhost multipathd: sdr: state = up
Feb 24 13:18:22 localhost multipathd: mpathi: disassemble map [3
queue_if_no_path pg_init_retries 50 1 alua 2 1 service-time 0 1 2 65:16 50 1
service-time 0 1 2 8:112 1 1 ]
Feb 24 13:18:22 localhost multipathd: mpathi: disassemble status [2 0 1 0 2 1 A
0 1 2 65:16 A 0 0 1 E 0 1 2 8:112 A 0 0 1 ]
Feb 24 13:18:22 localhost multipathd: sdr: mask = 0x8
Feb 24 13:18:22 localhost multipathd: sdr: path state = running
Feb 24 13:18:22 localhost multipathd: reported target port group is 61447
Feb 24 13:18:22 localhost multipathd: aas = 00 [active/optimized]
Feb 24 13:18:22 localhost multipathd: sdr: alua prio = 50
....
Feb 24 13:18:22 localhost multipathd: sdh: path state = running
Feb 24 13:18:22 localhost multipathd: sdh: get_state
Feb 24 13:18:22 localhost multipathd: 8:112: tur checker starting up
Feb 24 13:18:22 localhost multipathd: 8:112: tur checker finished, state up
Feb 24 13:18:22 localhost multipathd: sdh: state = up
Feb 24 13:18:22 localhost multipathd: mpathi: disassemble map [3
queue_if_no_path pg_init_retries 50 1 alua 2 1 service-time 0 1 2 65:16 50 1
service-time 0 1 2 8:112 1 1 ]
Feb 24 13:18:22 localhost multipathd: mpathi: disassemble status [2 0 1 0 2 1 A
0 1 2 65:16 A 0 0 1 E 0 1 2 8:112 A 0 0 1 ]
Feb 24 13:18:22 localhost multipathd: sdh: mask = 0x8
Feb 24 13:18:22 localhost multipathd: sdh: path state = running
Feb 24 13:18:22 localhost multipathd: reported target port group is 61461
Feb 24 13:18:22 localhost multipathd: aas = 02 [standby]
Feb 24 13:18:22 localhost multipathd: sdh: alua prio = 1

Now if the Active/Optimized path suddenly stops responding to I/O, causing
timeouts forever, multipath hangs including the path_checker.  The hang is
observed by all multipath and multipathd commands hang, and there are no longer
any path checker logs messages.  The below messages occur over and over again
showing the active path not responding to I/O.

Feb 24 13:39:42 localhost kernel: scsi target1:0:1:
enclosure_logical_id(0x5b083fe0ead0a900), slot(4)
Feb 24 13:39:46 localhost kernel: sd 1:0:1:7: task abort: SUCCESS
scmd(ffff8804661f4540)
Feb 24 13:39:46 localhost kernel: sd 1:0:1:7: attempting task abort!
scmd(ffff8804661f7480)
Feb 24 13:39:46 localhost kernel: sd 1:0:1:7: [sdr] CDB: 
Feb 24 13:39:46 localhost kernel: Write(10): 2a 00 00 01 e1 c0 00 04 00 00
Feb 24 13:39:46 localhost kernel: scsi target1:0:1: handle(0x000a),
sas_address(0x5000d31001108307), phy(4)
Feb 24 13:39:46 localhost kernel: scsi target1:0:1:
enclosure_logical_id(0x5b083fe0ead0a900), slot(4)

The hang seems to occur indefinetly until the devices are offlined through some
other means and multipath then resumes execution.

It is our expectation, the Active/Optimized path would eventually fail and I/O
would then be attempted down the Standby path. Or it is our expectation the
path checker would run, and in this case the TUR would fail to the
Active/Optimized path, and the Standby path would have transitioned to
Active/Optimized=priority 50.  Is this a false expectation and/or why does
multipath hang during this time?

-- 
You are receiving this mail because:
You are the assignee for the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html