Re: [PATCH] [PATCH] libmultipath: return 'ghost' state when port is in transition

Brian Bunker <brian@xxxxxxxxxxxxxxx> · Thu, 9 Mar 2023 13:40:07 -0800

> On Mar 7, 2023, at 1:15 PM, Martin Wilck <mwilck@xxxxxxxx> wrote:
> 
> Brian,
> 
> On Tue, 2023-03-07 at 11:41 -0800, Brian Bunker wrote:
>> 
>>> On Mar 7, 2023, at 2:31 AM, Martin Wilck <mwilck@xxxxxxxx> wrote:
>>> 
>>> On Mon, 2023-03-06 at 13:04 -0600, Benjamin Marzinski wrote:
>>>> On Mon, Mar 06, 2023 at 12:46:50PM +0100, Martin Wilck wrote:
>>>>> Hi Brian,
>>>>> 
>>>>> On Sat, 2023-03-04 at 12:49 -0800, Brian Bunker wrote:
>>>>>>> 
>>>>>> The checking for standby is 14 years old, and says that TUR
>>>>>> returns
>>>>>> a unit attention when the path is in standby. I am not sure
>>>>>> why
>>>>>> that
>>>>>> wouldn’t be handled by this code above: I would think there
>>>>>> should be
>>>>>> just one unit attention for each I_T_L when an ALUA state
>>>>>> change
>>>>>> happens.Not sure how it exceeds the retry count.
>>>>>> 
>>>>>> if (key == 0x6) {
>>>>>>     /* Unit Attention, retry */
>>>>>>     if (—retry_tur)
>>>>>>         goto retry;
>>>>>> }
>>>>>> 
>>>>>> From my perspective failing a path for ALUA state standby is
>>>>>> reasonable
>>>>>> since it is not an active state.
>>>>> 
>>>>> I think the historic rationale for using GHOST is that some
>>>>> storage
>>>>> arrays, in particular active-passive configurations, may keep
>>>>> certain
>>>>> port groups in STANDBY state. If STANDBY was classified as
>>>>> FAILED,
>>>>> "multipath -ll" would show all paths of such port groups as
>>>>> FAILED,
>>>>> which would confuse users.
>>>>> 
>>>>> That's what I meant before, multipath's GHOST can mean multiple
>>>>> things
>>>>> depending on the actual hardware in use, explicit/implicit
>>>>> ALUA,
>>>>> etc.
>>>>> Given that today basically every hardware supports ALUA, we
>>>>> could
>>>>> probably do better. But changing the semantics in the current
>>>>> situation
>>>>> is risky and error prone.
>>>> 
>>>> I am sympathetic to Martin's view that GHOST is an ambiguous
>>>> state,
>>>> and
>>>> it's not at all clear that in means "temporarily between states".
>>>> In
>>>> fact, it ususally doesn't.  On the other hand, if we can be
>>>> pretty
>>>> certain that devices won't keep paths in the TRANSITIONING state
>>>> for
>>>> an
>>>> extended time, but we can't be certain what the end state will
>>>> be, I
>>>> do
>>>> see the rationale for not failing them preemtively. 
>>> 
>>> This is an important point, for which I don't see a general
>>> solution.
>>> Unfortunately, if a device is TRANSITIONING, the SCSI spec offers
>>> no
>>> means for us to determine what state it's transitioning to, not
>>> even
>>> whether the transition is "up" or "down" in the state hierarchy. We
>>> can
>>> only guess from the previous state, but it will never be more than
>>> just
>>> that, a guess.
>> If a device is transitioning, it is quite likely that the other paths
>> to this
>> device might also be transitioning. Any response that fails the path
>> could lead to the failing of all of the paths. I think the initiator
>> making
>> any decision about the path state, up or down, for sure is wrong. In
>> general, if ALUA transitioning is returned, retry the same path up to
>> the transition timeout.
>>> 
>>>> I wonder if PATH_PENDING makes more sense.  We would retain the
>>>> existing
>>>> state until the path left the TRANSITIONING state.  The question
>>>> is,
>>>> are
>>>> you trying to make paths that are transitioning out of a failed
>>>> state
>>>> come back sooner, or are you trying to keep paths that were in a
>>>> active
>>>> state from being prevemtively failed.  Using PATH_PENDING won't
>>>> fix
>>>> the
>>>> first case, only the second.
>> More the second. In many cases paths to a device are served by
>> different parts of a system. You might have two controllers
>> performing
>> an HA function or even two arrays performing an HA function where
>> each array is made up of controllers performing an HA function. These
>> relationships will change based on external factors. There could be
>> connection disruptions in upgrade cases or individual controller
>> failures, etc. When this happens, an ALUA state change will happen
>> where the first step of that change could be returning ALUA state
>> transitioning until a more permanent state is achieved. That could be
>> all paths are back online, or some are and some are in some offline
>> state
>> directing initiators to paths which can serve I/O. The initiator
>> can’t
>> know what has happened by considering a single path.
> 
> Hm. Could it "know what has happened" by looking at multiple paths?
> I don't think so.
> 
> Are you arguing for or against TRANSITIONING = PATH_PENDING here?
> 
> Again: if we use PATH_PENDING for a previously active path that is now
> transitioning, it will be used for I/O, and be failed. If we use
> PATH_GHOST, it will be regrouped together with STANDBY paths and thus
> will only be used for I/O if all other good paths fail.
> 
> What is the right behavior, in your opinion?
> 
>>> 
>>> A very interesting suggestion. I like it.
>>> 
>>> I think that it makes little sense to try and make such paths "come
>>> back sooner". TRANSITIONING devices aren't usable, and any attempt
>>> to
>>> try to use them will cause an IO error and switch to FAILED state
>>> immediately by the kernel driver. PATH_PENDING would cause devices
>>> that
>>> are "coming up" to be checked frequently, and thus make them
>>> available
>>> within one checker interval of them becoming actually ACTIVE, which
>>> is
>>> about the best we can do in the "transitioning up" case.
>>> 
>>> When the path is going "down" from PATH_UP state, PATH_PENDING
>>> would
>>> imply that the kernel DM_STATE would remain as-is (probably
>>> "active").
>>> If I/O is happening, the device would sooner or later be used by
>>> the
>>> kernel, and the I/O would most probably fail, setting the path to
>>> FAILED. With PATH_GHOST, the path would get a lower priority and
>>> thus
>>> the likelyhood of it being used would be decreased, at least with
>>> group_by_prio (although this would mean that this path would be
>>> grouped
>>> together with STANDBY paths, see below).
>>> 
>>> Again, I think the behavior with PATH_PENDING would be the best we
>>> can
>>> get. Whether or not the kernel fails the device in the meantime,
>>> multipathd will issue TUR frequently, and eventually see the device
>>> arriving in a new state, which will probably be STANDBY or
>>> UNAVAILABLE.
>>> 
>>>> 
>>>> PATH_PENDING makes sure that if IO to the path does start
>>>> failing,
>>>> the
>>>> checker won't keep setting the path back to an active state
>>>> again. 
>>>> It
>>>> also avoids the another GHOST issue, where the path would end up
>>>> being
>>>> grouped with any actually passive paths, which isn't what we're
>>>> looking
>>>> for.
>>> 
>>> Good point! This causes pointless re-grouping of paths for group-
>>> by-
>>> prio for every ALUA transition. OTOH, we see such regrouping
>>> anyway, in
>>> particular if the paths don't transition simultaneously (or we
>>> don't
>>> detect the transition simultaneously). The only way to avoid this
>>> would
>>> be a path grouping algorithm that directly uses RTPG reported path
>>> groups rather than grouping by prio. We don't have such an
>>> algorithm
>>> currently.
>> Other OS’s do that. Once you have MPIO, both ESX and Windows
>> use RTPG and not TUR for determining individual path health.
> 
> We could do that too, but the code hasn't been written so far. It's
> further complicated on Linux by the fact that dm-multipath is based on
> the generic DM layer, which has no concept of ALUA.
> 
> Patches welcome ;-) 
> 
> IMO the first step towards a solution like this would be to set up path
> groups directly from RTPG data rather than indirectly through
> priorities. This would mean that (marginal path logic aside) path
> groups would stay constant, except if they change on the storage side
> (which happens rarely unless I am mistaken). I don't think we can use
> RTPG for path state because we need to track individual path states
> (see below).
> 
>> Even
>> this is not perfect. The ALUA state will be correct for the TPG
>> containing
>> the index of the I_T nexus that sent it, but the ALUA states of the
>> other
>> TPGs returned is also a best guess. Sending the RTPG down all paths
>> gives the full picture.
> 
> I don't think RTPG will ever give you the full picture. It may provide
> insight into how the storage array has organized its ports. But there
> are still cases where a single path may fail, or be shaky, because of
> fabric issues that have nothing to do with the storage array or ALUA.
> And this is also the reason why the initiator can't just rely on ALUA
> determining path state. It can happen that all ports change state
> simultaneously in a state transition. But it can also happen that just
> one path fails because of a broken cable.
> 
>>> 
>>>> The one problem I can think of off the top of my head would be
>>>> that
>>>> if
>>>> the device was held in the TRANSISTIONING state for a long time,
>>>> multipathd would keep checking it constantly, since PATH_PENDING
>>>> is
>>>> really meant for cases where the checker hasn't completed yet,
>>>> and we
>>>> just want to keep looking for the result. I suppose it would be
>>>> possible
>>>> to add another state that worked just like pending (and could
>>>> even
>>>> get
>>>> converted to PATH_PENDING if there was no other state to be
>>>> converted
>>>> to) but didn't cause us to retigger the checker so quickly.  But
>>>> if
>>>> devices really will only be in TRANSITIONING for a short time, it
>>>> might
>>>> not even be an issue we have to worry about.
>> This will likely depend on the kernel version. ALUA state
>> transitioning
>> has been handled differently in different versions of the kernel.
>> Currently
>> I believe a TUR sent to a path in ALUA state transition will not
>> return to
>> the caller until it either succeeds, fails with a different sense
>> code, or the
>> transition timeout has been exceeded. 
>> This hasn’t always been the case,
> 
> IMO this is wrong. alua_check_sense() returns NEEDS_RETRY for
> transitioning, and pass-through commands from multipathd will not be
> retried by the mid layer. And that has been so for a while.
> 
>> but I think is the right thing to do unless the multipath layer would
>> be the
>> only place to fail the path and could handle the retries and not the
>> kernel.
> 
> I don't understand what your are arguing for. Should we use
> PATH_PENDING for ALUA transitioning state, as Ben suggested, or
> PATH_GHOST, like in your original patch?
> 
> Regards
> Martin
Martin,

Sorry I bounce between kernel versions a lot since most of the
problems which find their way to us are released Linux versions
whose kernels are quite a bit older than upstream.I got a chance
to try the proposed solutions. The PATH_GHOST above does what
I am looking for which is not to have the path checker fail the path.

This also works as your earlier comments suggest. This does seem
clearer as to what is happening on the path:

diff --git a/libmultipath/checkers.c b/libmultipath/checkers.c
index fdb91e17..50f0773e 100644
--- a/libmultipath/checkers.c
+++ b/libmultipath/checkers.c
@@ -343,6 +343,7 @@ static const char *generic_msg[CHECKER_GENERIC_MSGTABLE_SIZE] = {
        [CHECKER_MSGID_UP] = " reports path is up",
        [CHECKER_MSGID_DOWN] = " reports path is down",
        [CHECKER_MSGID_GHOST] = " reports path is ghost",
+       [CHECKER_MSGID_PENDING] = " reports path is transitioning",
        [CHECKER_MSGID_UNSUPPORTED] = " doesn't support this device",
 };

diff --git a/libmultipath/checkers.h b/libmultipath/checkers.h
index ea1e8af6..4fbad621 100644
--- a/libmultipath/checkers.h
+++ b/libmultipath/checkers.h
@@ -111,6 +111,7 @@ enum {
        CHECKER_MSGID_UP,
        CHECKER_MSGID_DOWN,
        CHECKER_MSGID_GHOST,
+       CHECKER_MSGID_PENDING,
        CHECKER_MSGID_UNSUPPORTED,
        CHECKER_GENERIC_MSGTABLE_SIZE,
        CHECKER_FIRST_MSGID = 100,      /* lowest msgid for checkers */

diff --git a/libmultipath/checkers/tur.c b/libmultipath/checkers/tur.c
index 551dc4f0..e1742c2b 100644
--- a/libmultipath/checkers/tur.c
+++ b/libmultipath/checkers/tur.c
@@ -179,7 +179,7 @@ retry:
                else if (key == 0x2) {
                        /* Not Ready */
                        /* Note: Other ALUA states are either UP or DOWN */
-                       if( asc == 0x04 && ascq == 0x0b){
+                       if (asc == 0x04 && ascq == 0x0b) {
                                /*
                                 * LOGICAL UNIT NOT ACCESSIBLE,
                                 * TARGET PORT IN STANDBY STATE
@@ -187,6 +187,14 @@ retry:
                                *msgid = CHECKER_MSGID_GHOST;
                                return PATH_GHOST;
                        }
+                       if (asc == 0x04 && ascq == 0x0a) {
+                               /*
+                                * LOGICAL UNIT NOT ACCESSIBLE,
+                                * TARGET PORT IN TRANSITION STATE
+                                */
+                               *msgid = CHECKER_MSGID_PENDING;
+                               return PATH_PENDING;
+                       }
                }
                *msgid = CHECKER_MSGID_DOWN;
                return PATH_DOWN;

This change also keeps the path from being failed by the checker.
It seems to go into and out of transitioning from other states.

Thanks,
Brian 
> 
>>> 
>>> The default transitioning timeout is 60s, and in my experience,
>>> even if
>>> the hardware overrides it, it's rarely more than a few minutes.
>>> After
>>> that, the kernel will set the state to STANDBY.
>> Yes. The case of a target keeping a path in the transitioning state
>> Indefinitely is handled by the device handler.
>>> 
>>> Unless we're both overlooking something, I agree with you that
>>> PATH_PENDING is the right thing to do for TRANSITIONING. When a
>>> device
>>> is in transition between states, we _want_ to check it often to
>>> make
>>> sure we notice when the target state is reached.
>>> 
>>> We must then be careful not to overload PATH_PENDING with too many
>>> different meanings. But I don't see this as a big issue currently.
>>> 
>>> Regards
>>> Martin
>>> 
>>>> Thoughts?
>>>> 
>>>> -Ben
>>>> 
>>>>> 
>>>>> Regards
>>>>> Martin


--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/dm-devel