On Thu, 2023-11-09 at 12:26 -0500, Benjamin Marzinski wrote: > On Thu, Nov 09, 2023 at 09:07:23AM +0000, Martin Wilck wrote: > > > > In the case were the path_checker keeps succeeding, but the IOs > > > keep > > > hanging, multipathd will just keep restoring this path over and > > > over > > > again. That's the sort of path ping-ponging that shaky path > > > detection > > > should be able to stop. I guess this is can speed up other cases > > > for > > > failover as well, so I can leave that off. Hopefully people know > > > that > > > if they are seeing ping-ponging, shaky path detection can help > > > with > > > it. > > > > This is a general argument for enabling shaky paths detection. But > > I > > don't see how it relates to max_retries. Decreasing max_retries > > should > > make it less likely that regular IOs are hanging while the path > > checker > > succeeds, whether or not shaky path detection is enabled. By > > decreasing > > max_retries, we force the kernel to treat regular IO more like > > passthrough IO. AFAIU that should decrease the difference of > > failure > > probability between the path checker and regular IO. > > I wasn't being clear. Changing this won't make ping-ponging more > likely. > It's just that if you are in the case where IO is hanging for > extended > periods of time and then failing, but the path checker is succeeding, > and you want to an efficiently running system, changing max_retries > will > only get you halfway there, since it won't fix the ping-ponging. I > agree > that the ping-ponging is a seperate issue. My brain was sort of stuck > in > the specific customer issue that drove all this work, which did > involve > ping-ponging. > > > I have another question; as pointed out in my previous post about > > this > > patch, max_retries only affects the kernel's "maybe_retry" case, > > IOW > > mostly DID_TRANSPORT_DISRUPTED. This is a condition that can happen > > with shaky paths. But SCSI command timeouts are also likely, and > > for > > that case, reducing max_retries isn't going to help, as timed out > > commands won't be retried but passed to the error handler. > > DID_TRANSPORT_DISRUPTED errors will happen quickly most of the > > time. > > IOW, I don't quite understand how decreasing max_retries > > substantially > > decreases the time regular I/O would be hanging [1]. I associate > > hanging IO mostly with command timeouts. Am I missing something > > here? > > This work is all in response to a customer, who found that the only > way > to work around what turned out in the end to be an HBA issue, was to > lower max_retries and turn on shaky path detection, to make multipath > quickly ignore unusable paths. I admit that I didn't dig into why > reducing max_retries sped things up. They just asked if it was > possible > to make multipath control this scsi config parameter like it does > others, and it seemed like a reasonable request to me. Right. I didn't want to say it's not. I was just hoping that we could achieve a better understanding of the underlying issue. Martin