Re: ALUA for HA failover and multipathd

Hannes Reinecke <hare@xxxxxxx> · Wed, 11 Jun 2014 08:28:48 +0200

On 06/10/2014 07:22 PM, Nicholas A. Bellinger wrote:
On Tue, 2014-06-10 at 14:56 +0200, Hannes Reinecke wrote:
On 06/10/2014 11:11 AM, Chris Boot wrote:
On 06/06/14 07:34, Nicholas A. Bellinger wrote:
Hi Chris & Phillip,

On Thu, 2014-06-05 at 15:31 +0100, Chris Boot wrote:
Hi folks,

Background: I'm working on creating a Highly Available iSCSI system
using Pacemaker with some collaborators. We looked at the existing
iSCSILogicalUnit and iSCSITarget resource scripts, and they didn't seem
to do quite what we wanted so we have started down the route of writing
our own. Our new scripts are GPL and their current incarnations are
available at https://github.com/tigercomputing/ocf-lio

In general terms the setup is reasonably simple: we have a DRBD volume
running in dual-primary mode, which is then used to create an iblock
device, which itself is exported over iSCSI. We have been attempting to
use ALUA multipathing in implicit mode only to manage target failover.

We create two ALUA TPGs on each node, call them east/west, and mark one
as Active/Optimised and the other as Active/NonOptimised. When we create
the iSCSI TPGs on both nodes, one node's TPG is placed in the west ALUA
TPG and the other node's is placed into the east ALUA TPG.

When simulating a fail-over, the ALUA states on both east and west are
changed on both nodes and kept synchronised.

What we see when using multipathd on Linux as the initiator all appears
to work well until we switch roles on the target. multipathd seems to
stick to the old path, even though it is now NonOptimised and running
slowly due to the 100ms nonop_delay_msecs.

If, instead, we set the standby path to Standby mode rather than
Active/NonOptimised, multipathd correctly notices the path is
unavailable and sends IO over the Active/Optimised path. However, if the
initiator originally logs-in to the target while the path is in Standby
mode, it fails to correctly probe the device. When it becomes
Active/Optimised during failover, multipathd is unable to use it and
fails the path. The TUR checker returns that the path is active, though,
and makes the path active again, only to be failed again etc... The only
way to bring it back to life is to "echo 1 >
/sys/block/$DEV/device/rescan" and re-run multipath by hand.

I haven't been able to test this myself, but Philip (CCed) reports that
similar behaviour is seen using VMware as the initiator rather than Linux.

Has anyone managed to set up an ALUA multipath HA SAN with two nodes and
LIO? What are we missing? Am I going to have to throw in the towel on
ALUA and just use virtual IP failover instead?

After testing this evening with similar config on a single target
instance, the issue where initial LUN probe failures occur on a ALUA
group set implicitly to Standby state is reproducible..

The failure occurs during the initial READ_CAPACITY, which is currently
disallowed in opcode checking within core_alua_state_standby() code.  I
thought at one point READ_CAPACITY could fail during initial LUN probe
and still bring up a struct scsi_device with a zero number of sectors,
but could be wrong..? (Hannes CC'ed)

In any event, the following patch to permit READ_CAPACITY addresses the
initial LUN probe failure and works on my end, and should allow implicit
ALUA Active/* <-> Standby + vice versa state change to function now.

Please confirm with your setup.

Hi Nab,

Ack, this fixes the issue completely for me under Linux with multipathd.
The standby path is correctly probed now when you login to the target,
and when you fail-over to it everything carries on. Thanks very much!

Note that I tested on 3.14 so had to replace set_ascq() with *alua_ascq
as you did for the stable patches.

Tested-by: Chris Boot <crb@xxxxxxxxxxxxxxxxxxxxx>

FWIW, the kernel messages we obtain when probing the disk look like:

[  388.929254] scsi12 : iSCSI Initiator over TCP/IP
[  389.184537] scsi 12:0:0:0: Direct-Access     LIO-ORG  test1
    4.0  PQ: 0 ANSI: 5
[  389.184632] scsi 12:0:0:0: alua: supports implicit TPGS
[  389.185229] scsi 12:0:0:0: alua: port group 11 rel port 01
[  389.185390] scsi 12:0:0:0: alua: port group 11 state S non-preferred
supports TOlUSNA
[  389.185393] scsi 12:0:0:0: alua: Attached
[  389.185791] sd 12:0:0:0: Attached scsi generic sg5 type 0
[  389.186499] sd 12:0:0:0: [sde] 2147418040 512-byte logical blocks:
(1.09 TB/1023 GiB)
[  389.187254] sd 12:0:0:0: [sde] Write Protect is off
[  389.187258] sd 12:0:0:0: [sde] Mode Sense: 43 00 10 08
[  389.188301] sd 12:0:0:0: [sde] Write cache: enabled, read cache:
enabled, supports DPO and FUA
[  389.190677] ldm_validate_partition_table(): Disk read failed.
[  389.190705] Dev sde: unable to read RDB block 0
[  389.190734]  sde: unable to read partition table
[  389.192784] sd 12:0:0:0: [sde] Attached SCSI disk
[  389.246318] sd 10:0:0:0: alua: port group 10 state A preferred
supports TOlUSNA
[  389.325327] sd 10:0:0:0: alua: port group 10 state A preferred
supports TOlUSNA

Thanks for getting a patch to us so quickly, and sorry it took so long
to get it tested.

Thanks!

--nab

diff --git a/drivers/target/target_core_alua.c b/drivers/target/target_core_alua.c
index fcbe612..63512cc 100644
--- a/drivers/target/target_core_alua.c
+++ b/drivers/target/target_core_alua.c
@@ -576,7 +576,16 @@ static inline int core_alua_state_standby(
   	case REPORT_LUNS:
   	case RECEIVE_DIAGNOSTIC:
   	case SEND_DIAGNOSTIC:
+	case READ_CAPACITY:
   		return 0;
+	case SERVICE_ACTION_IN:
+		switch (cdb[1] & 0x1f) {
+		case SAI_READ_CAPACITY_16:
+			return 0;
+		default:
+			set_ascq(cmd, ASCQ_04H_ALUA_TG_PT_STANDBY);
+			return 1;
+		}
   	case MAINTENANCE_IN:
   		switch (cdb[1] & 0x1f) {
   		case MI_REPORT_TARGET_PGS:

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Hmm. While I agree with the patch (and confirm that it's required to get
multipath working on LIO-target), it really seems that multipath is
making incorrect assumptions here.
Looking at the spec READ_CAPACITY is indeed not required to be supported
for STANDBY paths, so multipath will fail for any ALUA implementations
following the spec more closely.

Guess we need to discuss this on dm-devel ...


<nod>, at least for Standby state, the spec gives a bit more leeway
here:

     "The device server may support other commands."

At least for READ_CAPACITY, Chris + Phillip reported that ESX is
expecting this to work in order to probe LUNs in Standby as well..

Oh, sure it does. And your patch just proves it.

No, what I'm worried about are _other_ ALUA implementations, which 
might take the spec a bit more literally.
And I dimly seem to remember we've had similar issues already.

It's simply bad design to rely on optional features; you should 
really only ever rely on mandatory ones, and treat those cautiously 
to boot. So relying on READ CAPACITY to even _establish_ a multipath 
device is prone to fail.

And the amount of issues we had with device resizing are a direct 
result from that. So personally I would love to see the READ 
CAPACITY check go from multipathing.
We need to evaluate that carefully, of course, but I think it's a 
worthwhile goal.

Cheers,

Hannes
--
Dr. Hannes Reinecke		      zSeries & Storage
hare@xxxxxxx			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html