Hello Doug,
On 05/05/15 18:10, Doug Ledford wrote:
Be that as it may, that doesn't change what I said about posting a
command to a known disconnected QP. You could just fail immediately.
Something like:
if (!ch->connected) {
scmnd->result = DID_NO_CONNECT;
goto err;
}
right after getting the channel in queuecommand would work. That would
save a couple spinlocks, several DMA mappings, a call into the low level
driver, and a few other things. (And I only left requeue on the table
because I wasn't sure how the blk_mq dealt with just a single channel
being down versus all of them being down)
What you wrote above looks correct to me. However, it is intentional
that such a check is not present in srp_queuecommand(). The intention
was to optimize the hot path of that driver as much as possible. Hence
the choice to post a work request on the QP even after it has been
disconnected and to let the HCA generate an error completion.
But my point in all of this is that if you have a single qp between
yourself and the target, then any error including a qp resource error ==
path error since you only have one path. When you have a multi queue
device, that's no longer true. A transient resource problem on one qp
does not mean a path event (at least not necessarily, although your
statement below converts a QP event into a path event by virtue
disconnecting and reconnecting all of the QPs). My curiosity is now
moot given what you wrote about tearing everything down and reconnecting
(unless the error handling is modified to be more subtle in its
workings), but the original question in my mind was what happens at the
blk_mq level if you did have a single queue drop but not all of them and
you weren't using multipath.
If we want to support this without adding similar code to handle this in
every SCSI LLD I think we need to change first how blk-mq and
dm-multipath interact. Today dm-multipath is a layer on top of blk-mq.
Supporting the above scenario properly is possible e.g. by integrating
multipath support in the blk-mq layer. I think Hannes and Christoph have
already started to work on this.
If only one channel fails all other channels are disconnected and the
transport layer error handling mechanism is started.
I missed that. I assume it's done in srp_start_tl_fail_timers()?
Yes, that's correct. Both QP errors and reception of a DREQ trigger a
call of srp_tl_err_work(). That last function calls
srp_start_tl_fail_timers() which starts the reconnection mechanism, at
least if the reconnect_delay parameter has a positive value (> 0).
Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html