Re: ALUA for HA failover and multipathd

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Tue, 10 Jun 2014 10:22:40 -0700

On Tue, 2014-06-10 at 14:56 +0200, Hannes Reinecke wrote:
> On 06/10/2014 11:11 AM, Chris Boot wrote:
> > On 06/06/14 07:34, Nicholas A. Bellinger wrote:
> >> Hi Chris & Phillip,
> >>
> >> On Thu, 2014-06-05 at 15:31 +0100, Chris Boot wrote:
> >>> Hi folks,
> >>>
> >>> Background: I'm working on creating a Highly Available iSCSI system
> >>> using Pacemaker with some collaborators. We looked at the existing
> >>> iSCSILogicalUnit and iSCSITarget resource scripts, and they didn't seem
> >>> to do quite what we wanted so we have started down the route of writing
> >>> our own. Our new scripts are GPL and their current incarnations are
> >>> available at https://github.com/tigercomputing/ocf-lio
> >>>
> >>> In general terms the setup is reasonably simple: we have a DRBD volume
> >>> running in dual-primary mode, which is then used to create an iblock
> >>> device, which itself is exported over iSCSI. We have been attempting to
> >>> use ALUA multipathing in implicit mode only to manage target failover.
> >>>
> >>> We create two ALUA TPGs on each node, call them east/west, and mark one
> >>> as Active/Optimised and the other as Active/NonOptimised. When we create
> >>> the iSCSI TPGs on both nodes, one node's TPG is placed in the west ALUA
> >>> TPG and the other node's is placed into the east ALUA TPG.
> >>>
> >>> When simulating a fail-over, the ALUA states on both east and west are
> >>> changed on both nodes and kept synchronised.
> >>>
> >>> What we see when using multipathd on Linux as the initiator all appears
> >>> to work well until we switch roles on the target. multipathd seems to
> >>> stick to the old path, even though it is now NonOptimised and running
> >>> slowly due to the 100ms nonop_delay_msecs.
> >>>
> >>> If, instead, we set the standby path to Standby mode rather than
> >>> Active/NonOptimised, multipathd correctly notices the path is
> >>> unavailable and sends IO over the Active/Optimised path. However, if the
> >>> initiator originally logs-in to the target while the path is in Standby
> >>> mode, it fails to correctly probe the device. When it becomes
> >>> Active/Optimised during failover, multipathd is unable to use it and
> >>> fails the path. The TUR checker returns that the path is active, though,
> >>> and makes the path active again, only to be failed again etc... The only
> >>> way to bring it back to life is to "echo 1 >
> >>> /sys/block/$DEV/device/rescan" and re-run multipath by hand.
> >>>
> >>> I haven't been able to test this myself, but Philip (CCed) reports that
> >>> similar behaviour is seen using VMware as the initiator rather than Linux.
> >>>
> >>> Has anyone managed to set up an ALUA multipath HA SAN with two nodes and
> >>> LIO? What are we missing? Am I going to have to throw in the towel on
> >>> ALUA and just use virtual IP failover instead?
> >>
> >> After testing this evening with similar config on a single target
> >> instance, the issue where initial LUN probe failures occur on a ALUA
> >> group set implicitly to Standby state is reproducible..
> >>
> >> The failure occurs during the initial READ_CAPACITY, which is currently
> >> disallowed in opcode checking within core_alua_state_standby() code.  I
> >> thought at one point READ_CAPACITY could fail during initial LUN probe
> >> and still bring up a struct scsi_device with a zero number of sectors,
> >> but could be wrong..? (Hannes CC'ed)
> >>
> >> In any event, the following patch to permit READ_CAPACITY addresses the
> >> initial LUN probe failure and works on my end, and should allow implicit
> >> ALUA Active/* <-> Standby + vice versa state change to function now.
> >>
> >> Please confirm with your setup.
> >
> > Hi Nab,
> >
> > Ack, this fixes the issue completely for me under Linux with multipathd.
> > The standby path is correctly probed now when you login to the target,
> > and when you fail-over to it everything carries on. Thanks very much!
> >
> > Note that I tested on 3.14 so had to replace set_ascq() with *alua_ascq
> > as you did for the stable patches.
> >
> > Tested-by: Chris Boot <crb@xxxxxxxxxxxxxxxxxxxxx>
> >
> > FWIW, the kernel messages we obtain when probing the disk look like:
> >
> > [  388.929254] scsi12 : iSCSI Initiator over TCP/IP
> > [  389.184537] scsi 12:0:0:0: Direct-Access     LIO-ORG  test1
> >    4.0  PQ: 0 ANSI: 5
> > [  389.184632] scsi 12:0:0:0: alua: supports implicit TPGS
> > [  389.185229] scsi 12:0:0:0: alua: port group 11 rel port 01
> > [  389.185390] scsi 12:0:0:0: alua: port group 11 state S non-preferred
> > supports TOlUSNA
> > [  389.185393] scsi 12:0:0:0: alua: Attached
> > [  389.185791] sd 12:0:0:0: Attached scsi generic sg5 type 0
> > [  389.186499] sd 12:0:0:0: [sde] 2147418040 512-byte logical blocks:
> > (1.09 TB/1023 GiB)
> > [  389.187254] sd 12:0:0:0: [sde] Write Protect is off
> > [  389.187258] sd 12:0:0:0: [sde] Mode Sense: 43 00 10 08
> > [  389.188301] sd 12:0:0:0: [sde] Write cache: enabled, read cache:
> > enabled, supports DPO and FUA
> > [  389.190677] ldm_validate_partition_table(): Disk read failed.
> > [  389.190705] Dev sde: unable to read RDB block 0
> > [  389.190734]  sde: unable to read partition table
> > [  389.192784] sd 12:0:0:0: [sde] Attached SCSI disk
> > [  389.246318] sd 10:0:0:0: alua: port group 10 state A preferred
> > supports TOlUSNA
> > [  389.325327] sd 10:0:0:0: alua: port group 10 state A preferred
> > supports TOlUSNA
> >
> > Thanks for getting a patch to us so quickly, and sorry it took so long
> > to get it tested.
> >
> >> Thanks!
> >>
> >> --nab
> >>
> >> diff --git a/drivers/target/target_core_alua.c b/drivers/target/target_core_alua.c
> >> index fcbe612..63512cc 100644
> >> --- a/drivers/target/target_core_alua.c
> >> +++ b/drivers/target/target_core_alua.c
> >> @@ -576,7 +576,16 @@ static inline int core_alua_state_standby(
> >>   	case REPORT_LUNS:
> >>   	case RECEIVE_DIAGNOSTIC:
> >>   	case SEND_DIAGNOSTIC:
> >> +	case READ_CAPACITY:
> >>   		return 0;
> >> +	case SERVICE_ACTION_IN:
> >> +		switch (cdb[1] & 0x1f) {
> >> +		case SAI_READ_CAPACITY_16:
> >> +			return 0;
> >> +		default:
> >> +			set_ascq(cmd, ASCQ_04H_ALUA_TG_PT_STANDBY);
> >> +			return 1;
> >> +		}
> >>   	case MAINTENANCE_IN:
> >>   		switch (cdb[1] & 0x1f) {
> >>   		case MI_REPORT_TARGET_PGS:
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe target-devel" in
> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >
> >
> Hmm. While I agree with the patch (and confirm that it's required to get 
> multipath working on LIO-target), it really seems that multipath is 
> making incorrect assumptions here.
> Looking at the spec READ_CAPACITY is indeed not required to be supported 
> for STANDBY paths, so multipath will fail for any ALUA implementations 
> following the spec more closely.
> 
> Guess we need to discuss this on dm-devel ...
> 

<nod>, at least for Standby state, the spec gives a bit more leeway
here:

    "The device server may support other commands."

At least for READ_CAPACITY, Chris + Phillip reported that ESX is
expecting this to work in order to probe LUNs in Standby as well..

--nab

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html