Re: ESXi + LIO + Ceph RBD problem

Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> · Sat, 22 Aug 2015 14:54:19 -0400

>> 2. In the block layer add callouts/cmds so that we can abort
>> requests/bios at the LLD level.
>>
>> 3. For rbd, we will implement support for #2. In ceph then we would need
>> to add code to be able to track down commands and kill them if we can or
>> at least figure out what is going on and log a message so we do not have
>> these mysterious hung commands.
>
> We just had a short network disruption, likely simply leaf/spine
> overload, which temporarily hung up RBD<->LIO traffic.  ESXi<->LIO
> traffic stayed up.  RBD seems to allow for long IO waits, i.e. you
> could wait 30+ seconds for RBD IO to complete, but ESXi goes into a
> death spiral after 5 seconds.  So if there were an option on either
> LIO or RBD side to just fail an IO that did not complete within say 4
> seconds, this would take care of the nasty consequences on ESXi side.
>
> Can RBD IO be aborted after a given number of seconds?
>
> ESXi will then retry the IO and if the problem was transient, that IO
> will complete and life goes on.

Thanks to Mike Christie's excellent analysis, a new issue has been
identified that will prevent at least some of the ESXi/LIO/Ceph
issues.  A number of these implementations use clustering, i.e.
Pacemaker, same as what we do.  Upon failover, the logic is to start
the target(s) then open these up to initiators then start the LUNs.
However, apparently ESXi will scan the targets on failover, discover
that they have no LUNs (in the brief period between target and LUN
start) and will not rescan the target any more.

So what has to happen is either not enable the target or block the
ports on failover until all LUNs complete their startup.  We will
implement this behavior shortly and advise on test results.

Another test I am planning to perform in lab is to just disconnect the
Ceph public network from an LIO node, but leave the iSCSI network
connected to ESXi.  This should cause timeouts, then a failover to
another node and a rescan.   Ideally, the RBD device will abort IOs in
progress so ESXi knows they are not going to complete and does not
wait.

Regards,
Alex
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html