Re: ALUA for HA failover and multipathd

Chris Boot <crb@xxxxxxxxxxxxxxxxxxxxx> · Tue, 10 Jun 2014 10:11:57 +0100

On 06/06/14 07:34, Nicholas A. Bellinger wrote:
> Hi Chris & Phillip,
> 
> On Thu, 2014-06-05 at 15:31 +0100, Chris Boot wrote:
>> Hi folks,
>>
>> Background: I'm working on creating a Highly Available iSCSI system
>> using Pacemaker with some collaborators. We looked at the existing
>> iSCSILogicalUnit and iSCSITarget resource scripts, and they didn't seem
>> to do quite what we wanted so we have started down the route of writing
>> our own. Our new scripts are GPL and their current incarnations are
>> available at https://github.com/tigercomputing/ocf-lio
>>
>> In general terms the setup is reasonably simple: we have a DRBD volume
>> running in dual-primary mode, which is then used to create an iblock
>> device, which itself is exported over iSCSI. We have been attempting to
>> use ALUA multipathing in implicit mode only to manage target failover.
>>
>> We create two ALUA TPGs on each node, call them east/west, and mark one
>> as Active/Optimised and the other as Active/NonOptimised. When we create
>> the iSCSI TPGs on both nodes, one node's TPG is placed in the west ALUA
>> TPG and the other node's is placed into the east ALUA TPG.
>>
>> When simulating a fail-over, the ALUA states on both east and west are
>> changed on both nodes and kept synchronised.
>>
>> What we see when using multipathd on Linux as the initiator all appears
>> to work well until we switch roles on the target. multipathd seems to
>> stick to the old path, even though it is now NonOptimised and running
>> slowly due to the 100ms nonop_delay_msecs.
>>
>> If, instead, we set the standby path to Standby mode rather than
>> Active/NonOptimised, multipathd correctly notices the path is
>> unavailable and sends IO over the Active/Optimised path. However, if the
>> initiator originally logs-in to the target while the path is in Standby
>> mode, it fails to correctly probe the device. When it becomes
>> Active/Optimised during failover, multipathd is unable to use it and
>> fails the path. The TUR checker returns that the path is active, though,
>> and makes the path active again, only to be failed again etc... The only
>> way to bring it back to life is to "echo 1 >
>> /sys/block/$DEV/device/rescan" and re-run multipath by hand.
>>
>> I haven't been able to test this myself, but Philip (CCed) reports that
>> similar behaviour is seen using VMware as the initiator rather than Linux.
>>
>> Has anyone managed to set up an ALUA multipath HA SAN with two nodes and
>> LIO? What are we missing? Am I going to have to throw in the towel on
>> ALUA and just use virtual IP failover instead?
> 
> After testing this evening with similar config on a single target
> instance, the issue where initial LUN probe failures occur on a ALUA
> group set implicitly to Standby state is reproducible..
> 
> The failure occurs during the initial READ_CAPACITY, which is currently
> disallowed in opcode checking within core_alua_state_standby() code.  I
> thought at one point READ_CAPACITY could fail during initial LUN probe
> and still bring up a struct scsi_device with a zero number of sectors,
> but could be wrong..? (Hannes CC'ed)
> 
> In any event, the following patch to permit READ_CAPACITY addresses the
> initial LUN probe failure and works on my end, and should allow implicit
> ALUA Active/* <-> Standby + vice versa state change to function now.
> 
> Please confirm with your setup.

Hi Nab,

Ack, this fixes the issue completely for me under Linux with multipathd.
The standby path is correctly probed now when you login to the target,
and when you fail-over to it everything carries on. Thanks very much!

Note that I tested on 3.14 so had to replace set_ascq() with *alua_ascq
as you did for the stable patches.

Tested-by: Chris Boot <crb@xxxxxxxxxxxxxxxxxxxxx>

FWIW, the kernel messages we obtain when probing the disk look like:

[  388.929254] scsi12 : iSCSI Initiator over TCP/IP
[  389.184537] scsi 12:0:0:0: Direct-Access     LIO-ORG  test1
  4.0  PQ: 0 ANSI: 5
[  389.184632] scsi 12:0:0:0: alua: supports implicit TPGS
[  389.185229] scsi 12:0:0:0: alua: port group 11 rel port 01
[  389.185390] scsi 12:0:0:0: alua: port group 11 state S non-preferred
supports TOlUSNA
[  389.185393] scsi 12:0:0:0: alua: Attached
[  389.185791] sd 12:0:0:0: Attached scsi generic sg5 type 0
[  389.186499] sd 12:0:0:0: [sde] 2147418040 512-byte logical blocks:
(1.09 TB/1023 GiB)
[  389.187254] sd 12:0:0:0: [sde] Write Protect is off
[  389.187258] sd 12:0:0:0: [sde] Mode Sense: 43 00 10 08
[  389.188301] sd 12:0:0:0: [sde] Write cache: enabled, read cache:
enabled, supports DPO and FUA
[  389.190677] ldm_validate_partition_table(): Disk read failed.
[  389.190705] Dev sde: unable to read RDB block 0
[  389.190734]  sde: unable to read partition table
[  389.192784] sd 12:0:0:0: [sde] Attached SCSI disk
[  389.246318] sd 10:0:0:0: alua: port group 10 state A preferred
supports TOlUSNA
[  389.325327] sd 10:0:0:0: alua: port group 10 state A preferred
supports TOlUSNA

Thanks for getting a patch to us so quickly, and sorry it took so long
to get it tested.

> Thanks!
> 
> --nab
> 
> diff --git a/drivers/target/target_core_alua.c b/drivers/target/target_core_alua.c
> index fcbe612..63512cc 100644
> --- a/drivers/target/target_core_alua.c
> +++ b/drivers/target/target_core_alua.c
> @@ -576,7 +576,16 @@ static inline int core_alua_state_standby(
>  	case REPORT_LUNS:
>  	case RECEIVE_DIAGNOSTIC:
>  	case SEND_DIAGNOSTIC:
> +	case READ_CAPACITY:
>  		return 0;
> +	case SERVICE_ACTION_IN:
> +		switch (cdb[1] & 0x1f) {
> +		case SAI_READ_CAPACITY_16:
> +			return 0;
> +		default:
> +			set_ascq(cmd, ASCQ_04H_ALUA_TG_PT_STANDBY);
> +			return 1;
> +		}
>  	case MAINTENANCE_IN:
>  		switch (cdb[1] & 0x1f) {
>  		case MI_REPORT_TARGET_PGS:
> 
> --
> To unsubscribe from this list: send the line "unsubscribe target-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Chris Boot
Tiger Computing Ltd
"Linux for Business"

Tel: 01600 483 484
Web: http://www.tiger-computing.co.uk
Follow us on Facebook: http://www.facebook.com/TigerComputing

Registered in England. Company number: 3389961
Registered address: Wyastone Business Park,
 Wyastone Leys, Monmouth, NP25 3SR
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html