Re: [PATCH 1/7] libmultipath: Add max_retries config option

Martin Wilck <martin.wilck@xxxxxxxx> · Thu, 9 Nov 2023 19:46:49 +0000

On Thu, 2023-11-09 at 12:26 -0500, Benjamin Marzinski wrote:
> On Thu, Nov 09, 2023 at 09:07:23AM +0000, Martin Wilck wrote:
> 
> > > In the case were the path_checker keeps succeeding, but the IOs
> > > keep
> > > hanging, multipathd will just keep restoring this path over and
> > > over
> > > again. That's the sort of path ping-ponging that shaky path
> > > detection
> > > should be able to stop. I guess this is can speed up other cases
> > > for
> > > failover as well, so I can leave that off.  Hopefully people know
> > > that
> > > if they are seeing ping-ponging, shaky path detection can help
> > > with
> > > it.
> > 
> > This is a general argument for enabling shaky paths detection. But
> > I
> > don't see how it relates to max_retries. Decreasing max_retries
> > should 
> > make it less likely that regular IOs are hanging while the path
> > checker
> > succeeds, whether or not shaky path detection is enabled. By
> > decreasing
> > max_retries, we force the kernel to treat regular IO more like
> > passthrough IO. AFAIU that should decrease the difference of
> > failure
> > probability between the path checker and regular IO.
> 
> I wasn't being clear. Changing this won't make ping-ponging more
> likely.
> It's just that if you are in the case where IO is hanging for
> extended
> periods of time and then failing, but the path checker is succeeding,
> and you want to an efficiently running system, changing max_retries
> will
> only get you halfway there, since it won't fix the ping-ponging. I
> agree
> that the ping-ponging is a seperate issue. My brain was sort of stuck
> in
> the specific customer issue that drove all this work, which did
> involve
> ping-ponging. 
> 
> > I have another question; as pointed out in my previous post about
> > this
> > patch, max_retries only affects the kernel's "maybe_retry" case,
> > IOW
> > mostly DID_TRANSPORT_DISRUPTED. This is a condition that can happen
> > with shaky paths. But SCSI command timeouts are also likely, and
> > for
> > that case, reducing max_retries isn't going to help, as timed out
> > commands won't be retried but passed to the error handler.
> > DID_TRANSPORT_DISRUPTED errors will happen quickly most of the
> > time.
> > IOW, I don't quite understand how decreasing max_retries
> > substantially
> > decreases the time regular I/O would be hanging [1]. I associate
> > hanging IO mostly with command timeouts. Am I missing something
> > here?
> 
> This work is all in response to a customer, who found that the only
> way
> to work around what turned out in the end to be an HBA issue, was to
> lower max_retries and turn on shaky path detection, to make multipath
> quickly ignore unusable paths. I admit that I didn't dig into why
> reducing max_retries sped things up. They just asked if it was
> possible
> to make multipath control this scsi config parameter like it does
> others, and it seemed like a reasonable request to me.

Right. I didn't want to say it's not. I was just hoping that we could
achieve a better understanding of the underlying issue.

Martin