Re: [Lsf] Notes from the four separate IO track sessions at LSF/MM

Bart Van Assche <bart.vanassche@xxxxxxxxxxx> · Mon, 2 May 2016 11:49:54 -0700

On 04/29/2016 05:47 PM, Laurence Oberman wrote:
From: "Bart Van Assche" <bart.vanassche@xxxxxxxxxxx>
To: "Laurence Oberman" <loberman@xxxxxxxxxx>
Cc: "James Bottomley" <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx>, "linux-scsi" <linux-scsi@xxxxxxxxxxxxxxx>, "Mike Snitzer" <snitzer@xxxxxxxxxx>, linux-block@xxxxxxxxxxxxxxx, "device-mapper development" <dm-devel@xxxxxxxxxx>, lsf@xxxxxxxxxxxxxxxxxxxxxxxxxx
Sent: Friday, April 29, 2016 8:36:22 PM
Subject: Re:  [Lsf] Notes from the four separate IO track sessions at LSF/MM

On 04/29/2016 02:47 PM, Laurence Oberman wrote:
Recovery with 21 LUNS is 300s that have in-flights to abort.
[ ... ]
eh_deadline is set to 10 on the 2 qlogic ports, eh_timeout is set
to 10 for all devices. In multipath fast_io_fail_tmo=5

I jam one of the target array ports and discard the commands
effectively black-holing the commands and leave it that way until
we recover and I watch the I/O. The recovery takes around 300s even
with all the tuning and this effectively lands up in Oracle cluster
evictions.

This discussion started as a discussion about the time needed to fail
over from one path to another. How long did it take in your test before
I/O failed over from the jammed port to another port?
>
> Around 300s before the paths were declared hard failed and the
> devices offlined. This is when I/O restarts.
> The remaining paths on the second Qlogic port (that are not jammed)
> will not be used until the error handler activity completes.
>
> Until we get these for example, and device-mapper starts declaring
> paths down we are blocked.
> Apr 29 17:20:51 localhost kernel: sd 1:0:1:0: Device offlined - not
> ready after error recovery
> Apr 29 17:20:51 localhost kernel: sd 1:0:1:13: Device offlined - not
> ready after error recovery

Hello Laurence,

Everyone else on all mailing lists to which this message has been posted 
replies below the message. Please follow this convention.

Regarding the fail-over time: the ib_srp driver guarantees that 
scsi_done() is invoked from inside its terminate_rport_io() function. 
Apparently the lpfc and the qla2xxx drivers behave differently. Please 
work with the maintainers of these drivers to reduce fail-over time.

Bart.

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel