Re: ESXi + LIO + Ceph RBD problem

Martin Svec <martin.svec@xxxxxxxx> · Mon, 24 Aug 2015 12:49:09 +0200

Hello Alex,
> Thanks to Mike Christie's excellent analysis, a new issue has been
> identified that will prevent at least some of the ESXi/LIO/Ceph
> issues.  A number of these implementations use clustering, i.e.
> Pacemaker, same as what we do.  Upon failover, the logic is to start
> the target(s) then open these up to initiators then start the LUNs.
> However, apparently ESXi will scan the targets on failover, discover
> that they have no LUNs (in the brief period between target and LUN
> start) and will not rescan the target any more.
>
> So what has to happen is either not enable the target or block the
> ports on failover until all LUNs complete their startup.  We will
> implement this behavior shortly and advise on test results.

That's one of the reasons I rewrote the Resource Agent from scratch and configure LIO iSCSI directly
through configfs control plane. Our Resource Agent setups a TPGT as a whole, i.e. adds NPs, LUNs,
ACLs, CHAP, options like cmdsn_depth, etc., and then finally enables it. Also, we have another RA
sitting under the TPGT RAs that performs cleanup of the entire configfs plane on stop/restart. This
RA guarantees us full reset of the target configuration when something is really wrong and there're
trailing misconfigured HBAs/TPGTs/LUNs on the cluster node.

I made a number of stress tests with this setup few years ago and observed no false ESXi rescans
anymore.

Martin

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html