Re: [PATCH 1/7] libmultipath: Add max_retries config option

Martin Wilck <martin.wilck@xxxxxxxx> · Thu, 9 Nov 2023 09:07:23 +0000

Hi Ben,

On Wed, 2023-11-08 at 17:08 -0500, Benjamin Marzinski wrote:
> On Wed, Nov 08, 2023 at 03:36:14PM +0000, Martin Wilck wrote:
> > On Thu, 2023-11-02 at 18:15 -0400, Benjamin Marzinski wrote:
> > > This option lets multipath set a scsi disk's max_retries sysfs
> > > value.
> > > Setting this can be helpful for cases where the path checker
> > > succeeds,
> > > but IO commands hang and timeout. By default, the SCSI layer will
> > > retry
> > > IOs 5 times. Reducing this value will allow multipath to retry
> > > the IO
> > > down another path sooner.
> > > 
> > > Signed-off-by: Benjamin Marzinski <bmarzins@xxxxxxxxxx>
> > 
> > 2 nitpicks below. Please explain to me again why we recommend to
> > activate shaky paths detection with this. What will go wrong if the
> > user uses max_retries without shaky path detection?
> 

I've thought about this some more after my review. Similar to the auto-
resize, do we need to make this parameter a hardware property? AFAICS
it would be sufficient to have it as setting in the "defaults" section,
which would make the patch much simpler.

> In the case were the path_checker keeps succeeding, but the IOs keep
> hanging, multipathd will just keep restoring this path over and over
> again. That's the sort of path ping-ponging that shaky path detection
> should be able to stop. I guess this is can speed up other cases for
> failover as well, so I can leave that off.  Hopefully people know
> that
> if they are seeing ping-ponging, shaky path detection can help with
> it.

This is a general argument for enabling shaky paths detection. But I
don't see how it relates to max_retries. Decreasing max_retries should 
make it less likely that regular IOs are hanging while the path checker
succeeds, whether or not shaky path detection is enabled. By decreasing
max_retries, we force the kernel to treat regular IO more like
passthrough IO. AFAIU that should decrease the difference of failure
probability between the path checker and regular IO.

I have another question; as pointed out in my previous post about this
patch, max_retries only affects the kernel's "maybe_retry" case, IOW
mostly DID_TRANSPORT_DISRUPTED. This is a condition that can happen
with shaky paths. But SCSI command timeouts are also likely, and for
that case, reducing max_retries isn't going to help, as timed out
commands won't be retried but passed to the error handler.
DID_TRANSPORT_DISRUPTED errors will happen quickly most of the time.
IOW, I don't quite understand how decreasing max_retries substantially
decreases the time regular I/O would be hanging [1]. I associate
hanging IO mostly with command timeouts. Am I missing something here?

Regards
Martin

[1] Note that SAM_STAT_BUSY is treated by the kernel with
ADD_TO_MLQUEUE, which means max_retries is ignored.