Re: [PATCH] Fix: Sometimes mdmon throws core dump during reshape

"Williams, Dan J" <dan.j.williams@xxxxxxxxx> · Tue, 6 Sep 2011 12:09:54 -0700

On Mon, Sep 5, 2011 at 3:39 AM, Adam Kwolek <adam.kwolek@xxxxxxxxx> wrote:
> Problem was found during reshaping 2 volumes /raid0 and raid5/ in container.
> Sometimes mdmon throws core dump due to NULL pointer exception.
>
> Problem occurs in scenario:
> - managemon: is about spare activation (degraded raid4 volume == raid0 under takeover)
> - managemon: detect level change and signals monitor (manage_member() calls replace_array())
> - monitor: detects transition raid4/5->raid0 and sets a->container to NULL
>           to indicate array deactivation

Maybe I have lost track of the reshape implementation but I don't see
where the monitor sets ->container to NULL during a reshape?  Do you
mean deactivate mdmon for the array after the reshape completes?

> - managemon : continues his work and tries to activate spare (a->check_degraded is set).
>              NULL pointer is passed to metadata handler activate_spare()
>              Core dump is generated.
>
> To resolve this situation managemon (after monitor kick) checks again
> a->container pointer to learn if current array is not to be deactivated.
[..]
> diff --git a/managemon.c b/managemon.c
> index d020f82..3540dac 100644
> --- a/managemon.c
> +++ b/managemon.c
> @@ -475,6 +475,12 @@ static void manage_member(struct mdstat_ent *mdstat,
>                }
>        }
>
> +       /* we are after monitor kick,
> +        * so container field can be cleared - check it again
> +        */
> +       if (a->container == NULL)
> +               return;
> +

Isn't this still racy?  Because we don't wait for the monitor to run
before proceeding.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html