Re: [PATCH 1/1] scsi_dh_alua: properly handling the ALUA transitioning state

Hannes Reinecke <hare@xxxxxxx> · Sat, 21 May 2022 12:17:03 +0200

On 5/20/22 13:03, Martin Wilck wrote:
On Fri, 2022-05-20 at 14:08 -0500, Mike Christie wrote:
On 5/20/22 9:03 AM, Martin Wilck wrote:
On Fri, 2022-05-20 at 14:06 +0200, Hannes Reinecke wrote:
On 5/20/22 12:57, Martin Wilck wrote:
Brian, Martin,

sorry, I've overlooked this patch previously. I have to say I
think
it's wrong and shouldn't have been applied. At least I need
more
in-
depth explanation.

On Mon, 2022-05-02 at 20:50 -0400, Martin K. Petersen wrote:
On Mon, 2 May 2022 08:09:17 -0700, Brian Bunker wrote:

The handling of the ALUA transitioning state is currently
broken.
When
a target goes into this state, it is expected that the
target
is
allowed to stay in this state for the implicit transition
timeout
without a path failure.

Can you please show me a quote from the specs on which this
expectation
("without a path failure") is based? AFAIK the SCSI specs don't
say
anything about device-mapper multipath semantics.

The handler has this logic, but it gets
skipped currently.

When the target transitions, there is in-flight I/O from
the
initiator. The first of these responses from the target
will be
a
unit
attention letting the initiator know that the ALUA state
has
changed.
The remaining in-flight I/Os, before the initiator finds
out
that
the
portal state has changed, will return not ready, ALUA state
is
transitioning. The portal state will change to
SCSI_ACCESS_STATE_TRANSITIONING. This will lead to all new
I/O
immediately failing the path unexpectedly. The path failure
happens
in
less than a second instead of the expected successes until
the
transition timer is exceeded.

dm multipath has no concept of "transitioning" state. Path
state
can be
either active or inactive. As Brian wrote, commands sent to the
transitioning device will return NOT READY, TRANSITIONING, and
require
retries on the SCSI layer. If we know this in advance, why
should
we
continue sending I/O down this semi-broken path? If other,
healthy
paths are available, why it would it not be the right thing to
switch
I/O to them ASAP?

But we do, don't we?
Commands are being returned with the appropriate status, and
dm-multipath should make the corresponding decisions here.
This patch just modifies the check when _sending_ commands; ie
multipath
had decided that the path is still usable.
Question rather would be why multipath did that;

If alua_prep_fn() got called, the path was considered usable at the
given point in time by dm-multipath. Most probably the reason was
simply that no error condition had occured on this path before ALUA
state switched to transitioning. I suppose this can happen
if storage
switches a PG consisting of multiple paths to TRANSITIONING. We get
an
error on one path (sda, say), issue an RTPG, and receive the new
ALUA
state for all paths of the PG. For all paths except sda, we'd just
see
a switch to TRANSITIONING without a previous SCSI error.

With this patch, we'll dispatch I/O (usually an entire bunch) to
these
paths despite seeing them in TRANSITIONING state. Eventually, when
the
SCSI responses are received, this leads to path failures. If I/O
latencies are small, this happens after a few ms. In that case, the
goal of Brian's patch is not reached, because the time until path
failure would still be on the order of milliseconds. OTOH, if
latencies
are high, it takes substantially longer for the kernel to realize
that
the path is non-functional, while other, good paths may be idle. I
fail
to see the benefit.

I'm not sure everyone agrees with you on the meaning of
transitioning.

If we go from non-optimized to optimized or standby to non-
opt/optimized
we don't want to try other paths because it can cause thrashing.

But only with explicit ALUA, or am I missing something? I agree that
the host shouldn't initiate a PG switch if it encounters transitioning
state. I also agree that for transitioning towards a "better" state,
e.g. standby to (non)-optimized, failing the path would be
questionable. Unfortunately we don't know in which "direction" the path
is transitioning - it could be for 'better' or 'worse'. I suppose that
in the case of a PG switch, it can happen that we dispatch I/O to a
device that used to be in Standby and is now transitioning. Would it
make sense to remember the previous state and "guess" what we're going
to transition to? I.e. if the previous state was "Standby", it's
probably going to be (non)optimized after the transition, and vice-
versa?

  We just
need to transition resources before we can fully use the path. It
could
be a local cache operation or for distributed targets it could be a
really
expensive operation.

For both though, it can take longer than the retries we get from
scsi-ml.

So if we want to do "the right thing", we'd continue dispatching to the
device until either the state changes or the device-reported transition
timeout has expired?

Not quite.
I think multipath should make the decision, and we shouldn't try to 
out-guess multipathing from the device driver.
We should only make the information available (to wit: the path state) 
for multipath to make the decision.
But if multipath decides to send I/O via a path which we _think_ it's in 
transitioning we shouldn't be arguing with that (ie we shouldn't preempt 
is from prep_fn) but rather let it rip; who knows, maybe there was a 
reason for that.

The only thing this patch changes (for any current setup) is that we'll 
incur additional round-trip times for I/O being sent in between the 
first I/O returning with 'transitioning' and multipath picking up the 
new path state. This I/O will be returned more-or-less immediately by 
the target, so really we're only having an additional round-trip latency 
for each I/O during that time. This is at most in the milliseconds 
range, so I guess we can live with that.

If, however, multipath fails to pick up the 'transitioning' state then 
this won't work, and we indeed will incur a regression. But that 
arguably is an issue with multipath ...

Cheers,

Hannes
--
Dr. Hannes Reinecke                Kernel Storage Architect
hare@xxxxxxx                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman