Re: [PATCH] md/raid5: release batch_last before waiting for another stripe_head

John Pittman <jpittman@xxxxxxxxxx> · Mon, 2 Oct 2023 15:21:59 -0400

Thanks a lot David.  Song, as a note, David's patch was tested by a
Red Hat customer and it indeed resolved their hit on the deadlock.
cc. Laurence Oberman who assisted on that case.

On Mon, Oct 2, 2023 at 2:39 PM David Jeffery <djeffery@xxxxxxxxxx> wrote:
>
> When raid5_get_active_stripe is called with a ctx containing a stripe_head in
> its batch_last pointer, it can cause a deadlock if the task sleeps waiting on
> another stripe_head to become available. The stripe_head held by batch_last
> can be blocking the advancement of other stripe_heads, leading to no
> stripe_heads being released so raid5_get_active_stripe waits forever.
>
> Like with the quiesce state handling earlier in the function, batch_last
> needs to be released by raid5_get_active_stripe before it waits for another
> stripe_head.
>
>
> Fixes: 3312e6c887fe ("md/raid5: Keep a reference to last stripe_head for batch")
> Signed-off-by: David Jeffery <djeffery@xxxxxxxxxx>
>
> ---
>  drivers/md/raid5.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 6383723468e5..0644b83fd3f4 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -854,6 +854,13 @@ struct stripe_head *raid5_get_active_stripe(struct r5conf *conf,
>
>                 set_bit(R5_INACTIVE_BLOCKED, &conf->cache_state);
>                 r5l_wake_reclaim(conf->log, 0);
> +
> +               /* release batch_last before wait to avoid risk of deadlock */
> +               if (ctx && ctx->batch_last) {
> +                       raid5_release_stripe(ctx->batch_last);
> +                       ctx->batch_last = NULL;
> +               }
> +
>                 wait_event_lock_irq(conf->wait_for_stripe,
>                                     is_inactive_blocked(conf, hash),
>                                     *(conf->hash_locks + hash));
> --
> 2.41.0
>