Re: [PATCH 00/11] First pass at merging Bart's HA work

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 12/05/12 19:50, Bart Van Assche wrote:
On 12/05/12 19:23, Or Gerlitz wrote:
On Fri, Nov 30, 2012 at 4:21 AM, David Dillow <dillowda@xxxxxxxx> wrote:
[...]
Modulo a few style issues (braces around one line if branches, etc.) and
having three state variables vs one, I can live with everything up to
aabfa852acd27962 at git://github.com/bvanassche/linux.git#srp-ha. Those
two are small things that can be fixed later and are not worth holding
things up any further.

I'll try to spend some time on the final four patches tomorrow
afternoon.

Dave, Bart

My colleague Alex Turin <alextu@xxxxxxxxxxxx> tried  today the bits as
they appear in Roland's kernel.org tree / for-next branch up to commit
  fb57e1dbbd4 and here's some feedback

Basically, what he did was connecting  to a target, next take down the
IB port on the initiator side, and issue some IOs (dd if=/dev/sdb
of=/dev/null count=1)

Our recollection of events from the logs (below) is the following

1. queued command get completion status 5

2. as part of error handling srp_reset_host() was called,

3. srp_reset_host() calls to srp_reconnect_target() which fails cause
port is down.

4. srp_reconnect_target() on failure calls to srp_queue_remove_work()
which sets
target->status to SRP_TARGET_REMOVED.

5.srp_reset_host() called second time. it calls to
srp_reconnect_target() but target->state == SRP_TARGET_REMOVED.
srp_reconnect_target() checks if target->state != SRP_TARGET_LIVE and
return -EAGAIN.

This probably means that even after enabling port it will still fail
to reconnect?

Hello Or,

The only way to make I/O work reliably if a failure can occur at the
transport layer is to use multipathd on top of ib_srp. If a connection
fails for some reason, then the SRP SCSI host will be removed after the
SCSI error handler has finished with its error recovery strategy. And
once the transport layer is operational again and srp_daemon detects
that the initiator is no longer logged in srp_daemon will make ib_srp
log in again. multipathd will then cause I/O to continue over the new path.

(replying to my own e-mail)

Another possible approach would be to follow the FC model and to block I/O when a port goes down and to unblock I/O once I/O is again possible. Some time ago I had posted a patch that went somewhat in this direction and in which ib_srp tried to reconnect to a target repeatedly after a transport layer failure. That patch can be found here:

http://www.mail-archive.com/linux-rdma@xxxxxxxxxxxxxxx/msg10158.html

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]
  Powered by Linux