Re: [PATCH] multipathd: the sysfs prioritizer can return stale data

Brian Bunker <brian@xxxxxxxxxxxxxxx> · Wed, 31 Jan 2024 14:23:36 -0800



> On Jan 31, 2024, at 12:12 PM, Martin Wilck <mwilck@xxxxxxxx> wrote:
> 
> Hi Brian,
> 
> On Wed, 2024-01-31 at 09:45 -0800, Brian Bunker wrote:
>> 
>> 
>>> On Jan 31, 2024, at 2:19 AM, Martin Wilck <mwilck@xxxxxxxx> wrote:
>>> 
>>> Hi Brian,
>>> 
>>> On Tue, 2024-01-30 at 10:43 -0800, Brian Bunker wrote:
>>>>> 
>>>>> A full rescan shouldn't be necessary. All that's needed is that
>>>>> the
>>>>> kernel issue another RTPG. AFAICS that should happen as soon as
>>>>> the
>>>>> target responds to any command with a sense key of UNIT
>>>>> ATTENTION
>>>>> with
>>>>> ASC=0x2a and ascq=6 or 7 (ALUA state change, ALUA state
>>>>> transition
>>>>> failed).
>>>>> 
>>>>> @Brian, does your storage unit not do this? If so, I suggest we
>>>>> disable
>>>>> the sysfs prioritizer for pure storage.
>>>>> 
>>>>> Otherwise, as far as multipathd is concerned, when a path is
>>>>> reinstated, it should be sufficient to send any IO command to
>>>>> trigger
>>>>> an RTPG. Or am I missing something here?
>>>>> 
>>>>> Martin
>>>> 
>>>> Martin,
>>>> 
>>>> What I am gaining with the rescan is exactly that. You are
>>>> correct
>>>> the ALUA device handler the kernel has to send an RTPG to the
>>>> target.
>>>> 
>>>> We do set a unit attention to have the initiator paths go into
>>>> the
>>>> ANO state before we reboot leading to the path loss, but we do
>>>> not
>>>> set a unit attention when the paths come back up.
>>>> 
>>>> We have relied on the initiator’s polling to pick up the ALUA
>>>> state
>>>> change which they always have in the past and the ‘alua’
>>>> prioritizer
>>>> still will. For us to add a unit attention would work, but there
>>>> are
>>>> couple of issues with that.
>>>> 
>>>> 1. Unit attentions may not get back to the initiator. It is not
>>>> guaranteed.
>>> 
>>> That's news for me. If that happens, wouldn't it mean that the
>>> initiator sees a timeout (no response) to some command, IOW that
>>> there's still something very wrong with this I_T nexus?
>> Probably. My point is just that any individual response could be lost
>> wherever and there is no burden on the target to ensure the initiator
>> got the unit attention.
> 
> Right. In the specific case of the ALUA-related UAs that we've been
> discussing, the storage *might* implement a logic to repeat responding
> with UA until an RTPG is received. But that may have  unwanted side
> effects; I haven't thought it through.
> 
>>> 
>>>> 2. Paths could take a very long time to come back. We might not
>>>> get
>>>> these paths back for a very long time. Sometimes it is just a
>>>> reboot.
>>>> Other times it is a hardware replacement. It is possible for us
>>>> to
>>>> keep this state forever and post when when that I_T nexus returns
>>>> but
>>>> we haven’t had to.
>>> 
>>> No offense, that sounds somewhat lazy ;-) Note that it's also kind
>>> of
>>> dangerous. You are hiding the state change from the initiator. If
>>> the
>>> Linux kernel decided to use the access_state device attribute for
>>> anything else but feeding the sysfs attribute, things might go
>>> badly
>>> wrong.
>> That is fair. We definitely could do better here. In general, that
>> unit
>> attention coming out of the preferred state didn’t buy us any speed.
>> Those non-preferred paths weren’t serving I/O so the first I/O that
>> would pick up the unit attention on those paths would be the path
>> checker. The same run of the path checker picked up the new ALUA
>> state. When going into the preferred state, there is read and write
>> I/O which means those unit attentions are picked up very quickly and
>> the ALUA state change is picked up in the kernel before the checker
>> runs again.
>> 
>> Have you ever considered a checker of RTPG as opposed to TUR?
>> That would seeming solve a lot of this too since you would be getting
>> path state and priority in the same trip.
> 
> Interesting idea. AFAICT, noone has thought about it so far. In the
> past I've invested some thought in tieing  checker and prioritizer more
> closely together. Unfortunately, the current multipathd architecture
> treats them as entirely separate, which makes it complicated to achieve
> This.
Yeah. It would be a re-do of the separation between checkers and
prioritizers and might leave you with a mess if you had to keep everything
functioning the same way wasn’t ALUA. I think it might make check_path
much cleaner though.
>>> 
>>>> If we did post the unit attention, everything works as expected.
>>>> I
>>>> have verified this, but I would also hope that the polling of the
>>>> checkers would also unstick my stale ALUA state and we won’t have
>>>> to.
>>>> 
>>>> I put this rescan_path inline to show the problem and the fix. I
>>>> wasn’t sure the ‘right’ place to put it. I get that it would be
>>>> better not to block on this. It should be possible to put this in
>>>> a
>>>> thread so that it does not. The other caller of rescan_path I
>>>> guess
>>>> is also doing the same thing when it is handling the wwid change.
>>> 
>>> That's true and not optimal, but wwid changes are rare events and
>>> an
>>> error condition in its own right. Patches converting moving
>>> rescan_path() into a thread would receive sympathetic reviews :-)
>>> The big benefit of the sysfs prioritizer is that it never blocks,
>>> without needing pthreads complexity.
>>> 
>>> Btw, I think you'd need to wait for the RTPG to finish after
>>> triggering
>>> the rescan, if you want to obtain the right priority (alua_rescan()
>>> ->
>>> alua_initialize() -> alua_check_vpd() will only queue an RTPG and
>>> not
>>> wait for the result).
>> For our purposes we didn’t really need to wait for the rescan. As
>> long as
>> it happened. The next time the checker ran it would pick it up. These
>> paths
>> returning for us are redundant paths. We want them back as soon as
>> possible
>> but we have other paths that can serve I/O while waiting for the HA
>> paths.
>> 
>> I can create a patch in the sysfs prioritizer to do the rescan_path
>> in a thread
>> that the checker and priority run doesn’t wait on. Would that be well
>> received
>> or I am better served by either posting a unit attention or just
>> using detect_prio
>> set to no and leaving the ’sysfs’ prioritizer alone?
> 
> I am not sure. Like I said, the beauty of the sysfs prioritizer is that
> it doesn't do any IO. rescanning a device from this prioritizer
> basically voids this benefit. You might as well just use alua.
> As I said, we do it for RDAC as well.
> 
> But if you want to invest more effort, feel free to submit patches ;-)
> You can do this any time on top of the hwtable change.
OK I will submit the hwtable change first since there is no controversy
there at all.

My change only affects the reinstate path so it is isn’t like
‘alua’ where priority is evaluated with RTPG at each checker instance.
The rescan should be rare since I_T nexus loss shouldn’t be too common.
> 
>>> Unfortunately, the kernel has no API for manually triggering an
>>> update
>>> of the access_state. I believe that would be useful elsewhere, too.
>>> We
>>> can consider adding it, but it won't help with current kernels.
>>> 
>>> IMO the best option for your storage arrays would is to force using
>>> the
>>> alua prioritizer rather than the sysfs one. You are not alone,
>>> we're
>>> doing this for RDAC already (see check_rdac() call in
>>> detect_prio()).
>>> This can be configured in multipath.conf right now by setting
>>> "detect_prio no" and "prio alua", and we can make it the default
>>> for
>>> your storage with a trivial patch for hwtable.c.
>> This is what we are doing now in our recommended configuration. I
>> will
>> probably add a patch for our hw table entry soon. It is a bit strange
>> to
>> me still that detect_prio would mean replace the one that I am
>> explicitly
>> stating in the device section. To me detect_prio would be if I didn’t
>> provide one and wanted multipath to choose for me.
> 
> True. The design of the "detect_prio" option is awkward and confusing.
> It happens all the time that people set "prio" and wonder why the
> setting isn't applied. Or worse, they think it is applied but it's not.
> I think the original idea was to use ALUA whenever supported, even if
> historically the hwtable contained something else for a given storage.
> It has been this way for a long time, I don't think we can change the
> semantics easily.
I saw Ben’s later comment explaining the history of how it became the
way it became. 
> 
> Regards
> Martin
Thanks,
Brian
> 
>>> 
>>> Regards
>>> Martin
>>> 
>> Thanks,
>> Brian