Re: [PATCH] [PATCH] libmultipath: return 'ghost' state when port is in transition

Brian Bunker <brian@xxxxxxxxxxxxxxx> · Tue, 7 Mar 2023 11:41:54 -0800



> On Mar 7, 2023, at 2:31 AM, Martin Wilck <mwilck@xxxxxxxx> wrote:
> 
> On Mon, 2023-03-06 at 13:04 -0600, Benjamin Marzinski wrote:
>> On Mon, Mar 06, 2023 at 12:46:50PM +0100, Martin Wilck wrote:
>>> Hi Brian,
>>> 
>>> On Sat, 2023-03-04 at 12:49 -0800, Brian Bunker wrote:
>>>>> 
>>>> The checking for standby is 14 years old, and says that TUR
>>>> returns
>>>> a unit attention when the path is in standby. I am not sure why
>>>> that
>>>> wouldn’t be handled by this code above: I would think there
>>>> should be
>>>> just one unit attention for each I_T_L when an ALUA state change
>>>> happens.Not sure how it exceeds the retry count.
>>>> 
>>>> if (key == 0x6) {
>>>>     /* Unit Attention, retry */
>>>>     if (—retry_tur)
>>>>         goto retry;
>>>> }
>>>> 
>>>> From my perspective failing a path for ALUA state standby is
>>>> reasonable
>>>> since it is not an active state.
>>> 
>>> I think the historic rationale for using GHOST is that some storage
>>> arrays, in particular active-passive configurations, may keep
>>> certain
>>> port groups in STANDBY state. If STANDBY was classified as FAILED,
>>> "multipath -ll" would show all paths of such port groups as FAILED,
>>> which would confuse users.
>>> 
>>> That's what I meant before, multipath's GHOST can mean multiple
>>> things
>>> depending on the actual hardware in use, explicit/implicit ALUA,
>>> etc.
>>> Given that today basically every hardware supports ALUA, we could
>>> probably do better. But changing the semantics in the current
>>> situation
>>> is risky and error prone.
>> 
>> I am sympathetic to Martin's view that GHOST is an ambiguous state,
>> and
>> it's not at all clear that in means "temporarily between states". In
>> fact, it ususally doesn't.  On the other hand, if we can be pretty
>> certain that devices won't keep paths in the TRANSITIONING state for
>> an
>> extended time, but we can't be certain what the end state will be, I
>> do
>> see the rationale for not failing them preemtively. 
> 
> This is an important point, for which I don't see a general solution.
> Unfortunately, if a device is TRANSITIONING, the SCSI spec offers no
> means for us to determine what state it's transitioning to, not even
> whether the transition is "up" or "down" in the state hierarchy. We can
> only guess from the previous state, but it will never be more than just
> that, a guess.
If a device is transitioning, it is quite likely that the other paths to this
device might also be transitioning. Any response that fails the path
could lead to the failing of all of the paths. I think the initiator making
any decision about the path state, up or down, for sure is wrong. In
general, if ALUA transitioning is returned, retry the same path up to
the transition timeout.
> 
>> I wonder if PATH_PENDING makes more sense.  We would retain the
>> existing
>> state until the path left the TRANSITIONING state.  The question is,
>> are
>> you trying to make paths that are transitioning out of a failed state
>> come back sooner, or are you trying to keep paths that were in a
>> active
>> state from being prevemtively failed.  Using PATH_PENDING won't fix
>> the
>> first case, only the second.
More the second. In many cases paths to a device are served by
different parts of a system. You might have two controllers performing
an HA function or even two arrays performing an HA function where
each array is made up of controllers performing an HA function. These
relationships will change based on external factors. There could be
connection disruptions in upgrade cases or individual controller
failures, etc. When this happens, an ALUA state change will happen
where the first step of that change could be returning ALUA state
transitioning until a more permanent state is achieved. That could be
all paths are back online, or some are and some are in some offline state
directing initiators to paths which can serve I/O. The initiator can’t
know what has happened by considering a single path.
> 
> A very interesting suggestion. I like it.
> 
> I think that it makes little sense to try and make such paths "come
> back sooner". TRANSITIONING devices aren't usable, and any attempt to
> try to use them will cause an IO error and switch to FAILED state
> immediately by the kernel driver. PATH_PENDING would cause devices that
> are "coming up" to be checked frequently, and thus make them available
> within one checker interval of them becoming actually ACTIVE, which is
> about the best we can do in the "transitioning up" case.
> 
> When the path is going "down" from PATH_UP state, PATH_PENDING would
> imply that the kernel DM_STATE would remain as-is (probably "active").
> If I/O is happening, the device would sooner or later be used by the
> kernel, and the I/O would most probably fail, setting the path to
> FAILED. With PATH_GHOST, the path would get a lower priority and thus
> the likelyhood of it being used would be decreased, at least with
> group_by_prio (although this would mean that this path would be grouped
> together with STANDBY paths, see below).
> 
> Again, I think the behavior with PATH_PENDING would be the best we can
> get. Whether or not the kernel fails the device in the meantime,
> multipathd will issue TUR frequently, and eventually see the device
> arriving in a new state, which will probably be STANDBY or UNAVAILABLE.
> 
>> 
>> PATH_PENDING makes sure that if IO to the path does start failing,
>> the
>> checker won't keep setting the path back to an active state again. 
>> It
>> also avoids the another GHOST issue, where the path would end up
>> being
>> grouped with any actually passive paths, which isn't what we're
>> looking
>> for.
> 
> Good point! This causes pointless re-grouping of paths for group-by-
> prio for every ALUA transition. OTOH, we see such regrouping anyway, in
> particular if the paths don't transition simultaneously (or we don't
> detect the transition simultaneously). The only way to avoid this would
> be a path grouping algorithm that directly uses RTPG reported path
> groups rather than grouping by prio. We don't have such an algorithm
> currently.
Other OS’s do that. Once you have MPIO, both ESX and Windows
use RTPG and not TUR for determining individual path health. Even
this is not perfect. The ALUA state will be correct for the TPG containing
the index of the I_T nexus that sent it, but the ALUA states of the other
TPGs returned is also a best guess. Sending the RTPG down all paths
gives the full picture.
> 
>> The one problem I can think of off the top of my head would be that
>> if
>> the device was held in the TRANSISTIONING state for a long time,
>> multipathd would keep checking it constantly, since PATH_PENDING is
>> really meant for cases where the checker hasn't completed yet, and we
>> just want to keep looking for the result. I suppose it would be
>> possible
>> to add another state that worked just like pending (and could even
>> get
>> converted to PATH_PENDING if there was no other state to be converted
>> to) but didn't cause us to retigger the checker so quickly.  But if
>> devices really will only be in TRANSITIONING for a short time, it
>> might
>> not even be an issue we have to worry about.
This will likely depend on the kernel version. ALUA state transitioning
has been handled differently in different versions of the kernel. Currently
I believe a TUR sent to a path in ALUA state transition will not return to
the caller until it either succeeds, fails with a different sense code, or the
transition timeout has been exceeded. This hasn’t always been the case,
but I think is the right thing to do unless the multipath layer would be the
only place to fail the path and could handle the retries and not the kernel.
> 
> The default transitioning timeout is 60s, and in my experience, even if
> the hardware overrides it, it's rarely more than a few minutes. After
> that, the kernel will set the state to STANDBY.
Yes. The case of a target keeping a path in the transitioning state
Indefinitely is handled by the device handler.
> 
> Unless we're both overlooking something, I agree with you that
> PATH_PENDING is the right thing to do for TRANSITIONING. When a device
> is in transition between states, we _want_ to check it often to make
> sure we notice when the target state is reached.
> 
> We must then be careful not to overload PATH_PENDING with too many
> different meanings. But I don't see this as a big issue currently.
> 
> Regards
> Martin
> 
>> Thoughts?
>> 
>> -Ben
>> 
>>> 
>>> Regards
>>> Martin


--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/dm-devel