ALUA for HA failover and multipathd

Chris Boot <crb@xxxxxxxxxxxxxxxxxxxxx> · Thu, 05 Jun 2014 15:31:55 +0100

Hi folks,

Background: I'm working on creating a Highly Available iSCSI system
using Pacemaker with some collaborators. We looked at the existing
iSCSILogicalUnit and iSCSITarget resource scripts, and they didn't seem
to do quite what we wanted so we have started down the route of writing
our own. Our new scripts are GPL and their current incarnations are
available at https://github.com/tigercomputing/ocf-lio

In general terms the setup is reasonably simple: we have a DRBD volume
running in dual-primary mode, which is then used to create an iblock
device, which itself is exported over iSCSI. We have been attempting to
use ALUA multipathing in implicit mode only to manage target failover.

We create two ALUA TPGs on each node, call them east/west, and mark one
as Active/Optimised and the other as Active/NonOptimised. When we create
the iSCSI TPGs on both nodes, one node's TPG is placed in the west ALUA
TPG and the other node's is placed into the east ALUA TPG.

When simulating a fail-over, the ALUA states on both east and west are
changed on both nodes and kept synchronised.

What we see when using multipathd on Linux as the initiator all appears
to work well until we switch roles on the target. multipathd seems to
stick to the old path, even though it is now NonOptimised and running
slowly due to the 100ms nonop_delay_msecs.

If, instead, we set the standby path to Standby mode rather than
Active/NonOptimised, multipathd correctly notices the path is
unavailable and sends IO over the Active/Optimised path. However, if the
initiator originally logs-in to the target while the path is in Standby
mode, it fails to correctly probe the device. When it becomes
Active/Optimised during failover, multipathd is unable to use it and
fails the path. The TUR checker returns that the path is active, though,
and makes the path active again, only to be failed again etc... The only
way to bring it back to life is to "echo 1 >
/sys/block/$DEV/device/rescan" and re-run multipath by hand.

I haven't been able to test this myself, but Philip (CCed) reports that
similar behaviour is seen using VMware as the initiator rather than Linux.

Has anyone managed to set up an ALUA multipath HA SAN with two nodes and
LIO? What are we missing? Am I going to have to throw in the towel on
ALUA and just use virtual IP failover instead?

We'd really appreciate some input on this.

To set up the target on *both* nodes:

tcm_node --establishdev iblock_0/drbd1 /dev/drbd1
tcm_node --setunitserialwithmd iblock_0/drbd1
f88e7c31-77cb-46fd-90bb-dfc8a701406e
tcm_node --addaluatpgwithmd iblock_0/drbd1 lio_alua_west 100
tcm_node --addaluatpgwithmd iblock_0/drbd1 lio_alua_east 101
tcm_node --setaluatype=iblock_0/drbd1 lio_alua_west implict
tcm_node --setaluatype=iblock_0/drbd1 lio_alua_east implict
tcm_node --clearaluapref=iblock_0/drbd1 lio_alua_west
tcm_node --clearaluapref=iblock_0/drbd1 lio_alua_east

echo 100 >
/sys/kernel/config/target/core/iblock_0/drbd1/alua/lio_alua_west/nonop_delay_msecs
echo 0 >
/sys/kernel/config/target/core/iblock_0/drbd1/alua/lio_alua_west/trans_delay_msecs
echo 100 >
/sys/kernel/config/target/core/iblock_0/drbd1/alua/lio_alua_east/nonop_delay_msecs
echo 0 >
/sys/kernel/config/target/core/iblock_0/drbd1/alua/lio_alua_east/trans_delay_msecs

tcm_node --setaluastate=iblock_0/drbd1 lio_alua_west a
tcm_node --setaluastate=iblock_0/drbd1 lio_alua_east o
tcm_node --setaluapref=iblock_0/drbd1 lio_alua_east

lio_node --addnp iqn.2014-04.com.example:drbd1 1 0.0.0.0:3260
lio_node --addnodeacl iqn.2014-04.com.example:drbd1 1
iqn.1993-08.org.debian:01:feb992244813
lio_node --addlun iqn.2014-04.com.example:drbd1 1 0 lun0 iblock_0/drbd1
lio_node --addlunacl=iqn.2014-04.com.example:drbd1 1
iqn.2014-04.com.example 0 0
lio_node --enabletpg=iqn.2014-04.com.example:drbd1 1

On the west node only:
lio_node --setaluatpg=iqn.2014-04.com.example:drbd1 1 0 lio_alua_west

On the east node only:
lio_node --setaluatpg=iqn.2014-04.com.example:drbd1 1 0 lio_alua_east

To change from east to west, on *both* nodes:

tcm_node --clearaluapref=iblock_0/drbd1 lio_alua_east
tcm_node --setaluastate=iblock_0/drbd1 lio_alua_east a
tcm_node --setaluastate=iblock_0/drbd1 lio_alua_west o
tcm_node --setaluapref=iblock_0/drbd1 lio_alua_west

Our multipath.conf looks like:

devices {
    device {
        vendor "LIO-ORG"
        path_grouping_policy group_by_prio
        path_checker tur
        prio alua
        hardware_handler "1 alua"
        failback immediate
        rr_weight uniform
        no_path_retry 12
        rr_min_io 100
    }
}

A sample 'multipath -l' output looks like:

test1 (36001405571059945e344331baecb97b1) dm-3 LIO-ORG,test1
size=1024G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=0 status=active
| `- 63:0:0:0 sde 8:64 active undef running
`-+- policy='service-time 0' prio=0 status=enabled
  `- 61:0:0:0 sdc 8:32 active undef running

And 'multipath -ll':

test1 (36001405571059945e344331baecb97b1) dm-3 LIO-ORG,test1
size=1024G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 63:0:0:0 sde 8:64 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  `- 61:0:0:0 sdc 8:32 active ready running

Thanks,
Chris

-- 
Chris Boot
Tiger Computing Ltd
"Linux for Business"

Tel: 01600 483 484
Web: http://www.tiger-computing.co.uk
Follow us on Facebook: http://www.facebook.com/TigerComputing

Registered in England. Company number: 3389961
Registered address: Wyastone Business Park,
 Wyastone Leys, Monmouth, NP25 3SR
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html