Hi Ben, On Wed, 2023-11-08 at 17:08 -0500, Benjamin Marzinski wrote: > On Wed, Nov 08, 2023 at 03:36:14PM +0000, Martin Wilck wrote: > > On Thu, 2023-11-02 at 18:15 -0400, Benjamin Marzinski wrote: > > > This option lets multipath set a scsi disk's max_retries sysfs > > > value. > > > Setting this can be helpful for cases where the path checker > > > succeeds, > > > but IO commands hang and timeout. By default, the SCSI layer will > > > retry > > > IOs 5 times. Reducing this value will allow multipath to retry > > > the IO > > > down another path sooner. > > > > > > Signed-off-by: Benjamin Marzinski <bmarzins@xxxxxxxxxx> > > > > 2 nitpicks below. Please explain to me again why we recommend to > > activate shaky paths detection with this. What will go wrong if the > > user uses max_retries without shaky path detection? > I've thought about this some more after my review. Similar to the auto- resize, do we need to make this parameter a hardware property? AFAICS it would be sufficient to have it as setting in the "defaults" section, which would make the patch much simpler. > In the case were the path_checker keeps succeeding, but the IOs keep > hanging, multipathd will just keep restoring this path over and over > again. That's the sort of path ping-ponging that shaky path detection > should be able to stop. I guess this is can speed up other cases for > failover as well, so I can leave that off. Hopefully people know > that > if they are seeing ping-ponging, shaky path detection can help with > it. This is a general argument for enabling shaky paths detection. But I don't see how it relates to max_retries. Decreasing max_retries should make it less likely that regular IOs are hanging while the path checker succeeds, whether or not shaky path detection is enabled. By decreasing max_retries, we force the kernel to treat regular IO more like passthrough IO. AFAIU that should decrease the difference of failure probability between the path checker and regular IO. I have another question; as pointed out in my previous post about this patch, max_retries only affects the kernel's "maybe_retry" case, IOW mostly DID_TRANSPORT_DISRUPTED. This is a condition that can happen with shaky paths. But SCSI command timeouts are also likely, and for that case, reducing max_retries isn't going to help, as timed out commands won't be retried but passed to the error handler. DID_TRANSPORT_DISRUPTED errors will happen quickly most of the time. IOW, I don't quite understand how decreasing max_retries substantially decreases the time regular I/O would be hanging [1]. I associate hanging IO mostly with command timeouts. Am I missing something here? Regards Martin [1] Note that SAM_STAT_BUSY is treated by the kernel with ADD_TO_MLQUEUE, which means max_retries is ignored.