Re: [PATCH 1/7] libmultipath: Add max_retries config option

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Nov 09, 2023 at 09:07:23AM +0000, Martin Wilck wrote:
> Hi Ben,
> 
> On Wed, 2023-11-08 at 17:08 -0500, Benjamin Marzinski wrote:
> > On Wed, Nov 08, 2023 at 03:36:14PM +0000, Martin Wilck wrote:
> > > On Thu, 2023-11-02 at 18:15 -0400, Benjamin Marzinski wrote:
> > > > This option lets multipath set a scsi disk's max_retries sysfs
> > > > value.
> > > > Setting this can be helpful for cases where the path checker
> > > > succeeds,
> > > > but IO commands hang and timeout. By default, the SCSI layer will
> > > > retry
> > > > IOs 5 times. Reducing this value will allow multipath to retry
> > > > the IO
> > > > down another path sooner.
> > > > 
> > > > Signed-off-by: Benjamin Marzinski <bmarzins@xxxxxxxxxx>
> > > 
> > > 2 nitpicks below. Please explain to me again why we recommend to
> > > activate shaky paths detection with this. What will go wrong if the
> > > user uses max_retries without shaky path detection?
> > 
> 
> I've thought about this some more after my review. Similar to the auto-
> resize, do we need to make this parameter a hardware property? AFAICS
> it would be sufficient to have it as setting in the "defaults" section,
> which would make the patch much simpler.

Sure.

> 
> > In the case were the path_checker keeps succeeding, but the IOs keep
> > hanging, multipathd will just keep restoring this path over and over
> > again. That's the sort of path ping-ponging that shaky path detection
> > should be able to stop. I guess this is can speed up other cases for
> > failover as well, so I can leave that off.  Hopefully people know
> > that
> > if they are seeing ping-ponging, shaky path detection can help with
> > it.
> 
> This is a general argument for enabling shaky paths detection. But I
> don't see how it relates to max_retries. Decreasing max_retries should 
> make it less likely that regular IOs are hanging while the path checker
> succeeds, whether or not shaky path detection is enabled. By decreasing
> max_retries, we force the kernel to treat regular IO more like
> passthrough IO. AFAIU that should decrease the difference of failure
> probability between the path checker and regular IO.

I wasn't being clear. Changing this won't make ping-ponging more likely.
It's just that if you are in the case where IO is hanging for extended
periods of time and then failing, but the path checker is succeeding,
and you want to an efficiently running system, changing max_retries will
only get you halfway there, since it won't fix the ping-ponging. I agree
that the ping-ponging is a seperate issue. My brain was sort of stuck in
the specific customer issue that drove all this work, which did involve
ping-ponging. 

> I have another question; as pointed out in my previous post about this
> patch, max_retries only affects the kernel's "maybe_retry" case, IOW
> mostly DID_TRANSPORT_DISRUPTED. This is a condition that can happen
> with shaky paths. But SCSI command timeouts are also likely, and for
> that case, reducing max_retries isn't going to help, as timed out
> commands won't be retried but passed to the error handler.
> DID_TRANSPORT_DISRUPTED errors will happen quickly most of the time.
> IOW, I don't quite understand how decreasing max_retries substantially
> decreases the time regular I/O would be hanging [1]. I associate
> hanging IO mostly with command timeouts. Am I missing something here?

This work is all in response to a customer, who found that the only way
to work around what turned out in the end to be an HBA issue, was to
lower max_retries and turn on shaky path detection, to make multipath
quickly ignore unusable paths. I admit that I didn't dig into why
reducing max_retries sped things up. They just asked if it was possible
to make multipath control this scsi config parameter like it does
others, and it seemed like a reasonable request to me.

-Ben

> Regards
> Martin
> 
> [1] Note that SAM_STAT_BUSY is treated by the kernel with
> ADD_TO_MLQUEUE, which means max_retries is ignored.





[Index of Archives]     [DM Crypt]     [Fedora Desktop]     [ATA RAID]     [Fedora Marketing]     [Fedora Packaging]     [Fedora SELinux]     [Yosemite Discussion]     [KDE Users]     [Fedora Docs]

  Powered by Linux