Re: [RFC] fc transport: extensions for fast fail and dev loss

James Smart <James.Smart@xxxxxxxxxx> · Tue, 25 Jul 2006 14:49:17 -0400

Mike Christie wrote:

So is the fast_io_fail_tmo callback the terminate_rport_io callback?

Yes.
When fast_io_fail_tmo expires, it calls the terminate_rport_io() callback.

> If
so, are we supposed to unblock the rport/session/target from
fc_timeout_fail_rport_io

No... don't unblock.

> and call into the LLD and the LLD will set some
bit (or maybe check some rport/session/target/scsi_device bit) so that
incoming IO and IO sitting in the driver will be failed with something
like DID_BUS_BUSY so it goes to the upper layers?

The way this is managed in the fc transport is - the LLD calls the
transport when it establishes connectivity (an "add" call), and when it
loses connectivity (a "delete" call). When the transport receives the
delete call, it changs the rport state, blocks the rport, and starts the
dev_loss timeout (and potentially the fast_io_fail_tmo if < dev_loss).
If the LLD makes the add call prior to dev_loss expiring, then updates
the state, and unblocks the rport.  If dev_loss expires, it updates
state again (essentially the true deleted state) and tears down the
target tree.

To deal with requests being received while blocked, etc - the LLD's use
a helper routine (fc_remote_port_chkready()), which validates the rport
state, and if not valid (e.g. blocked or removed) returns the appropriate
status to return to the midlayer. If blocked, it returns DID_IMM_RETRY.
If deleted, it returns DID_NO_CONNECT.

What the above never dealt with was the i/o already in the driver. The
driver always had the option to terminate the active i/o when the loss
of connectivity occured, or it could just wait for it to timeout, etc
and be killed that way. This patch added the callback at dev_loss_tmo
to guarantee i/o is killed, and added the fast_io_fail_tmo if you
wanted a faster guarantee. If fast_io_fail_tmo expires and the callback
is called - it just kills the outstanding i/o and does nothing to the
rport's blocked state.

> I think I only the
unblock happen on success or fc_starget_delete, so IO in the driver
looks like it can get failed upwards but IO sitting in the queue sits
there until fc_rport_final_delete or success.

Yeah - essentially this is correct. I hope the above read that way.
I'm also hoping the iSER folks are reading this to get the general
feel of what's happening with block, dev_loss, etc.

If that is correct, what about a new device state? When the fail fast
tmo expires we can set the device to the new state, run the queue and
incoming IO or IO in the request_queue marked with FAILFAST can be
failed upwards by scsi-ml.

I just woke up though :)

Sounds reasonable. It is adding a new semantic to what was meant by
fast_fail - but it's in line with our goal.  The goal was to terminate
i/o so that they could be quick rescheduled on a different path rather
than wait (what may be a long time) for the dev_loss connectivity
timer to fire.  Makes sense you would want to make new i/o requests
bound by that same window.

-- james s
-
: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html