Re: ESXi + LIO + Ceph RBD problem

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Thu, 20 Aug 2015 00:33:12 -0700

On Wed, 2015-08-19 at 13:12 -0500, Mike Christie wrote:
> On 08/19/2015 11:16 AM, Alex Gorbachev wrote:
> > What is the difference, and is there willingness to allow LIO to be
> > modified to work with this ESXi behavior?  Or should we ask Vmware to
> > do something for ESXi to play better with LIO?  I cannot fix the code,
> > but would be happy to be the voice of the issue via any available
> > channels.
> 
> I think we want to:
> 
> 1. Allow lio to do more than wait for a command during aborts. For lio
> we will want to add callouts similar to how we can override
> discard/unmap behavior.
> 
> 2. In the block layer add callouts/cmds so that we can abort
> requests/bios at the LLD level.

An API for explicit backend I/O cancellation might be useful.

> 
> 3. For rbd, we will implement support for #2. In ceph then we would need
> to add code to be able to track down commands and kill them if we can or
> at least figure out what is going on and log a message so we do not have
> these mysterious hung commands.
> 

For the older make_request_fn() based raw block drivers using bio, the
internal I/O timeout handling is completely implementation dependent.

With modern blk-mq code, I/O timeout is driven by blk_mq_rq_timer()
walking HW queues to complete request via blk_mq_ops->timeout()
callback.

AFAICT, blk-mq drivers are already expected to provide this callback to
perform internal descriptor cleanup after request timeout.

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html