Re: Problems with multipathing between VMWare ESXi and LIO

Thomas Glanzmann <thomas@xxxxxxxxxxxx> · Sat, 25 May 2013 13:51:51 +0200

Hello Eljas,
Hello Nicholas,

* Nicholas A. Bellinger <nab@xxxxxxxxxxxxxxx> [2013-05-24 22:00]:
> Btw, if your going to run multiple network portals on the same subnet,
> you'll need separate VLANs to isolate traffic amongst the ports,
> otherwise you'll end up seeing strange scenarios where outgoing
> traffic is getting sent on the wrong Ethernet port.

The following config under Debian wheezy and VMware ESX allows you to
have two network cards in the same vlan/subnet and also avoids
asymmetric routing by using policy based routing and the arp filter on
Linux and VMkernel Port Bindings on ESX. I use it in production.

(wheezy) [~] cat /etc/network/interfaces
auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
        address 10.105.0.8
        netmask 255.255.0.0
        gateway 10.105.0.1
        post-up /sbin/ip rule add from 10.105.0.8 lookup 105 || /bin/true
        post-up /sbin/ip route add table 105 10.105.0.0/16 src 10.105.0.8 dev eth0 || /bin/true
        post-up /sbin/ip route add table 105 default src 10.105.0.8 via 10.105.0.1 dev eth0 || /bin/true
        post-up /bin/echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter || /bin/true

auto eth1
iface eth1 inet static
        address 10.105.0.9
        netmask 255.255.0.0
        post-up /sbin/ip rule add from 10.105.0.9 lookup 205 || /bin/true
        post-up /sbin/ip route add table 205 10.105.0.0/16 src 10.105.0.9 dev eth1 || /bin/true
        post-up /sbin/ip route add table 205 default src 10.105.0.9 via 10.105.0.1 dev eth1 || /bin/true
        post-up /bin/echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter || /bin/true
(wheezy) [~] ip route show
default via 10.105.0.1 dev eth0
10.105.0.0/16 dev eth0  proto kernel  scope link  src 10.105.0.8
10.105.0.0/16 dev eth1  proto kernel  scope link  src 10.105.0.9
(wheezy) [~] ip route show table 105
default via 10.105.0.1 dev eth0  src 10.105.0.8
10.105.0.0/16 dev eth0  scope link  src 10.105.0.8
(wheezy) [~] ip route show table 205
default via 10.105.0.1 dev eth1  src 10.105.0.9
10.105.0.0/16 dev eth1  scope link  src 10.105.0.9
(wheezy) [~] arp -an
? (10.105.200.243) at 00:50:56:61:c3:df [ether] on eth0
? (10.105.0.1) at 9c:4e:20:d9:8a:47 [ether] on eth1
? (10.105.200.244) at 00:50:56:68:45:80 [ether] on eth1
? (10.105.0.1) at 9c:4e:20:d9:8a:47 [ether] on eth0
? (10.105.200.244) at 00:50:56:68:45:80 [ether] on eth0
? (10.105.200.243) at 00:50:56:61:c3:df [ether] on eth1

# On the router:
merlin#show arp 10.105.0.8
Protocol  Address          Age (min)  Hardware Addr   Type   Interface
Internet  10.105.0.8             17   0025.90a5.61f0  ARPA   Vlan105
merlin#show arp 10.105.0.9
Protocol  Address          Age (min)  Hardware Addr   Type   Interface
Internet  10.105.0.9              0   0025.90a5.61f1  ARPA   Vlan105

Explanation: The default route in the global routing table is necessary to
initiate outbound connections to systems outside the subnet.. The
differentiation between subnet and default route in the interface specific
routing table is only necessary so that the PBR also works for
connections which are initiated from outside the subnet.  The following
line _per_ interface would suffice:

        post-up /sbin/ip route add table 105 default src 10.105.0.8 dev eth0 || /bin/true

However than connection initiated from a different subnet to the
interface which does not have the global default gateway set to it would
use asymmetric routing (in the example traffic inbound to 10.105.0.9
would flow out through eth0 but only if initiated from a different
subnet).

I verified the above setup with tcpdump by sniffing on the interface and
setting a filter to the ip address that should _not_ go through that
interface and than initiating traffic from within and outside the
subnet.

You can also see in the arp table that there are entries for each
VMkernel (10.105.0.1 (mgmt), 10.105.200.243 (iscsib) and 10.105.200.243
(iscsia) for each link. This setup btw. gives you _twice_ the throughput
to the storage because it uses two VMkernel ports on the esx server.

About the ESX server configuration. You need at least two VMkernel ports. Each
VMkernel port may only have one link active all other NICs have to be moved to
unused or be non existent. Than in the iscsi software initiator you have to add
them under port bindings. ESX port binding only works if the VMkernel
ports are in the same subnet as the target. Trust me I tried everything
here including proxy arp it does not work otherwise. But when you do
_not_ use port binding the target may also be in a different subnet. If
you google for it you will get the same statement from VMware.[1]

See screenshots for configuration:
http://thomas.glanzmann.de/iscsi/

What can you see here?

01.png: Two extra vmkernel ports having each only one uplink, you could also
        use one vswitch and than overwrite the failover order.
02.png: 2 Devices, 8 Targets, 8 Paths. Why? Two device, because we have two
        devices, 8 targets and 8 paths because I created one target for each
        device for optimal throughput. IIRC the same target is bound to one CPU
        so using one target per device scales better. Why do we see 8 PATHs?

        2 Devices, 4 Paths each:
        10.105.200.243 => 10.105.0.8
        10.105.200.243 => 10.105.0.9
        10.105.200.244 => 10.105.0.8
        10.105.200.244 => 10.105.0.9

See also 04.png, 05.png, 06.png and 07.png

03.png: iSCSI Portbinding configured in the iSCSI Initiator.

[1] http://blogs.vmware.com/vsphere/2011/08/vsphere-50-storage-features-part-12-iscsi-multipathing-enhancements.html

> > When VMWare tries to connect to target, these show up in the target's
> > dmesg (seemingly randomly, sometimes it's just one type, sometimes
> > it's both. And Min/Max keep changing):

> Mmm, this would seem to indicate that something is afoul below the
> application layer..

I think you misconfigured the target. The right configuration is:

targetcli <<EOF
set global auto_cd_after_create=false
/backstores/fileio create shared-01.v105.gmvl.de /var/tmp/shared-01.v105.gmvl.de size=80G buffered=true
/backstores/fileio create shared-02.v105.gmvl.de /var/tmp/shared-02.v105.gmvl.de size=80G buffered=true

/iscsi create iqn.2013-03.de.gmvl.v105.storage:shared-01.v105.gmvl.de
/iscsi/iqn.2013-03.de.gmvl.v105.storage:shared-01.v105.gmvl.de/tpgt1/portals create 10.105.0.8
/iscsi/iqn.2013-03.de.gmvl.v105.storage:shared-01.v105.gmvl.de/tpgt1/portals create 10.105.0.9
/iscsi/iqn.2013-03.de.gmvl.v105.storage:shared-01.v105.gmvl.de/tpgt1/luns create /backstores/fileio/shared-01.v105.gmvl.de lun=10
/iscsi/iqn.2013-03.de.gmvl.v105.storage:shared-01.v105.gmvl.de/tpgt1/ set attribute authentication=0 demo_mode_write_protect=0 generate_node_acls=1 cache_dynamic_acls=1

/iscsi create iqn.2013-03.de.gmvl.v105.storage:shared-02.v105.gmvl.de
/iscsi/iqn.2013-03.de.gmvl.v105.storage:shared-02.v105.gmvl.de/tpgt1/portals create 10.105.0.8
/iscsi/iqn.2013-03.de.gmvl.v105.storage:shared-02.v105.gmvl.de/tpgt1/portals create 10.105.0.9
/iscsi/iqn.2013-03.de.gmvl.v105.storage:shared-02.v105.gmvl.de/tpgt1/luns create /backstores/fileio/shared-02.v105.gmvl.de lun=20
/iscsi/iqn.2013-03.de.gmvl.v105.storage:shared-02.v105.gmvl.de/tpgt1/ set attribute authentication=0 demo_mode_write_protect=0 generate_node_acls=1 cache_dynamic_acls=1
saveconfig
yes
EOF

@Nicholas: Is there any chance that we can get the rid of the questions for
targetcli when stdin is _not_ a terminal. It really annoyed me. In the
beginning I was piping the config, hitting ctrl-z when it asked me the question
if I want to save the question after stdin has closed in an endless loop, kill
the process, went back in interactively and saved the config. :-) Probably you
tell me to learn python and use the API? :-) Maybe I should submit a patch.

Let me know if it works for you. In order to right this e-mail I
verified the setup by building it from scratch:

        - I installed one physical server with debian wheezy (why on
          earth has targetcli 1 GB of dependencies, I think I'll file a
          bugreport, I mean it is one python script or am I mistaken?)

        - I installed an ESX server and did the setup.

Before this e-mail I was never caring about traffic originating from a
different subnet, so I took the time to build a PBR ruleset and verified
it so that it works from within and outside the same subnet.

Cheers,
        Thomas
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html