Re: [PATCH] multipathd: Make sure to disable queueing if recovery has failed.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



How long does recovery take?     I am unclear on when the explictly
set queue_if_no_path is being overridden, and why disabling it is
useful.

Typically we are setting queue_if_no_path to forever with the intent
that it will survive longer storage and/or disk issues without
returning an error to the application and/or corrupting fses if the
storage issue can be fixed.

We generally expect it to recover when the storage comes back, but
that the storage could be experiencing significant issues for a
significant period of time >10 minutes.   Since the storage has to be
fixed to get things working again, there is a lot of negative value
that requires manual recovery steps when an error gets returned (fsck,
loss of data).

We also manually disable queueing if we need to remove the mpath
devices (paths are already gone as they were non-responsive > 24 hours
and removed via tmo_timeout), and/or forcible reboot the nodes when we
determine storage is not coming back.


On Mon, Nov 27, 2023 at 3:54 PM Benjamin Marzinski <bmarzins@xxxxxxxxxx> wrote:
>
> If a multipath device has no_path_retry set to a number and has lost all
> paths, gone into recovery mode, and timed out, it will disable
> queue_if_no_paths. After that, if one of those failed paths is removed,
> when the device is reloaded, queue_if_no_paths will be re-enabled.  When
> set_no_path_retry() is then called to update the queueing state, it will
> not disable queue_if_no_paths, since the device is still in the recovery
> state, so it believes no work needs to be done. The device will remain
> in the recovery state, with retry_ticks at 0, and queueing enabled,
> even though there are no usable paths.
>
> To fix this, in set_no_path_retry(), if no_path_retry is set to a number
> and the device is queueing but it is in recovery mode and out of
> retries with no usable paths, manually disable queue_if_no_path.
>
> Signed-off-by: Benjamin Marzinski <bmarzins@xxxxxxxxxx>
> ---
>  libmultipath/structs_vec.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/libmultipath/structs_vec.c b/libmultipath/structs_vec.c
> index 0e8a46e7..3cb23c73 100644
> --- a/libmultipath/structs_vec.c
> +++ b/libmultipath/structs_vec.c
> @@ -627,8 +627,18 @@ void set_no_path_retry(struct multipath *mpp)
>                             !mpp->in_recovery)
>                                 dm_queue_if_no_path(mpp->alias, 1);
>                         leave_recovery_mode(mpp);
> -               } else if (pathcount(mpp, PATH_PENDING) == 0)
> +               } else if (pathcount(mpp, PATH_PENDING) == 0) {
> +                       /*
> +                        * If in_recovery is set, enter_recovery_mode does
> +                        * nothing. If the device is already in recovery
> +                        * mode and has already timed out, manually call
> +                        * dm_queue_if_no_path to stop it from queueing.
> +                        */
> +                       if ((!mpp->features || is_queueing) &&
> +                           mpp->in_recovery && mpp->retry_tick == 0)
> +                               dm_queue_if_no_path(mpp->alias, 0);
>                         enter_recovery_mode(mpp);
> +               }
>                 break;
>         }
>  }
> --
> 2.41.0
>
>





[Index of Archives]     [DM Crypt]     [Fedora Desktop]     [ATA RAID]     [Fedora Marketing]     [Fedora Packaging]     [Fedora SELinux]     [Yosemite Discussion]     [KDE Users]     [Fedora Docs]

  Powered by Linux