Hello Alex, > Thanks to Mike Christie's excellent analysis, a new issue has been > identified that will prevent at least some of the ESXi/LIO/Ceph > issues. A number of these implementations use clustering, i.e. > Pacemaker, same as what we do. Upon failover, the logic is to start > the target(s) then open these up to initiators then start the LUNs. > However, apparently ESXi will scan the targets on failover, discover > that they have no LUNs (in the brief period between target and LUN > start) and will not rescan the target any more. > > So what has to happen is either not enable the target or block the > ports on failover until all LUNs complete their startup. We will > implement this behavior shortly and advise on test results. That's one of the reasons I rewrote the Resource Agent from scratch and configure LIO iSCSI directly through configfs control plane. Our Resource Agent setups a TPGT as a whole, i.e. adds NPs, LUNs, ACLs, CHAP, options like cmdsn_depth, etc., and then finally enables it. Also, we have another RA sitting under the TPGT RAs that performs cleanup of the entire configfs plane on stop/restart. This RA guarantees us full reset of the target configuration when something is really wrong and there're trailing misconfigured HBAs/TPGTs/LUNs on the cluster node. I made a number of stress tests with this setup few years ago and observed no false ESXi rescans anymore. Martin -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html