RE: [PATCH] Fix: Sometimes mdmon throws core dump during reshape

"Kwolek, Adam" <adam.kwolek@xxxxxxxxx> · Wed, 7 Sep 2011 06:25:33 +0000

> -----Original Message-----
> From: Williams, Dan J [mailto:dan.j.williams@xxxxxxxxx]
> Sent: Tuesday, September 06, 2011 9:10 PM
> To: Kwolek, Adam
> Cc: neilb@xxxxxxx; linux-raid@xxxxxxxxxxxxxxx; Ciechanowski, Ed;
> Neubauer, Wojciech
> Subject: Re: [PATCH] Fix: Sometimes mdmon throws core dump during
> reshape
> 
> On Mon, Sep 5, 2011 at 3:39 AM, Adam Kwolek <adam.kwolek@xxxxxxxxx>
> wrote:
> > Problem was found during reshaping 2 volumes /raid0 and raid5/ in
> container.
> > Sometimes mdmon throws core dump due to NULL pointer exception.
> >
> > Problem occurs in scenario:
> > - managemon: is about spare activation (degraded raid4 volume == raid0
> under takeover)
> > - managemon: detect level change and signals monitor (manage_member()
> calls replace_array())
> > - monitor: detects transition raid4/5->raid0 and sets a->container to
> NULL
> >           to indicate array deactivation
> 
> Maybe I have lost track of the reshape implementation but I don't see
> where the monitor sets ->container to NULL during a reshape?  Do you
> mean deactivate mdmon for the array after the reshape completes?
> 
> > - managemon : continues his work and tries to activate spare (a-
> >check_degraded is set).
> >              NULL pointer is passed to metadata handler
> activate_spare()
> >              Core dump is generated.
> >
> > To resolve this situation managemon (after monitor kick) checks again
> > a->container pointer to learn if current array is not to be
> deactivated.

Yes, when takeover is used. From one hand mdmon tries to resolve takeovered raid0 degradation "problem"
and backward takeover occurs meanwhile.

BR
Adam

> [..]
> > diff --git a/managemon.c b/managemon.c
> > index d020f82..3540dac 100644
> > --- a/managemon.c
> > +++ b/managemon.c
> > @@ -475,6 +475,12 @@ static void manage_member(struct mdstat_ent
> *mdstat,
> >                }
> >        }
> >
> > +       /* we are after monitor kick,
> > +        * so container field can be cleared - check it again
> > +        */
> > +       if (a->container == NULL)
> > +               return;
> > +
> 
> Isn't this still racy?  Because we don't wait for the monitor to run
> before proceeding.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html