On Wed, Aug 19, 2015 at 2:12 PM, Mike Christie <mchristi@xxxxxxxxxx> wrote: > On 08/19/2015 11:16 AM, Alex Gorbachev wrote: >> What is the difference, and is there willingness to allow LIO to be >> modified to work with this ESXi behavior? Or should we ask Vmware to >> do something for ESXi to play better with LIO? I cannot fix the code, >> but would be happy to be the voice of the issue via any available >> channels. > > I think we want to: > > 1. Allow lio to do more than wait for a command during aborts. For lio > we will want to add callouts similar to how we can override > discard/unmap behavior. > > 2. In the block layer add callouts/cmds so that we can abort > requests/bios at the LLD level. > > 3. For rbd, we will implement support for #2. In ceph then we would need > to add code to be able to track down commands and kill them if we can or > at least figure out what is going on and log a message so we do not have > these mysterious hung commands. We just had a short network disruption, likely simply leaf/spine overload, which temporarily hung up RBD<->LIO traffic. ESXi<->LIO traffic stayed up. RBD seems to allow for long IO waits, i.e. you could wait 30+ seconds for RBD IO to complete, but ESXi goes into a death spiral after 5 seconds. So if there were an option on either LIO or RBD side to just fail an IO that did not complete within say 4 seconds, this would take care of the nasty consequences on ESXi side. Can RBD IO be aborted after a given number of seconds? ESXi will then retry the IO and if the problem was transient, that IO will complete and life goes on. Thanks guys, this would make a huge difference for production critical operations. Alex > > I have been meaning to get to this, but as you have seen on the list I > have taken a couple wrong turns on the cluster support and am still > working on that. -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html