RE: [PATCH 2/2] imsm: FIX: Be more patient during loading matadata

"Kwolek, Adam" <adam.kwolek@xxxxxxxxx> · Wed, 13 Apr 2011 07:40:35 +0100

> -----Original Message-----
> From: dan.j.williams@xxxxxxxxx [mailto:dan.j.williams@xxxxxxxxx] On
> Behalf Of Dan Williams
> Sent: Wednesday, April 13, 2011 2:45 AM
> To: Kwolek, Adam
> Cc: neilb@xxxxxxx; linux-raid@xxxxxxxxxxxxxxx; Ciechanowski, Ed;
> Neubauer, Wojciech
> Subject: Re: [PATCH 2/2] imsm: FIX: Be more patient during loading
> matadata
> 
> On Tue, Apr 12, 2011 at 5:51 AM, Adam Kwolek <adam.kwolek@xxxxxxxxx>
> wrote:
> > Sometimes occurs that metadata cannot be loaded e.g. wrong check sum
> > It can happen due to metadata update racing with mdmon condition.
> > If mpb loading is tried again, it is loaded successfully.
> > Try to load metadata again before really giving up.
> >
> > Signed-off-by: Adam Kwolek <adam.kwolek@xxxxxxxxx>
> > ---
> >
> >  super-intel.c |   10 ++++++++--
> >  1 files changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/super-intel.c b/super-intel.c
> > index dc5e34e..d23267a 100644
> > --- a/super-intel.c
> > +++ b/super-intel.c
> > @@ -2773,8 +2773,14 @@ load_and_parse_mpb(int fd, struct intel_super
> *super, char *devname, int keep_fd
> >        int err;
> >
> >        err = load_imsm_mpb(fd, super, devname);
> > -       if (err)
> > -               return err;
> > +       if (err) {
> > +               /* try to load mpb again,
> > +                * in case of mdmon race we could have more luck...
> > +                */
> > +               err = load_imsm_mpb(fd, super, devname);
> > +               if (err)
> > +                       return err;
> > +       }
> >        err = load_imsm_disk(fd, super, devname, keep_fd);
> >        if (err)
> >                return err;
> 
> This is semi-duplicates the check we already do after returning from
> load_and_parse_mpb in load_super_imsm_all.  I'm curious, are you
> hitting this path from load_super_imsm?  If the container is assembled
> we should be loading from the container, if the container is not
> available then mdmon can't be running and checksum errors are real.
> 
> --
> Dan

My test scenario is that after boot I'm disassembling read only array and immediately new array is assembled for grow continuation.
Sometimes occurs that mdadm throws exception and core file is generated. It shows that anchor pointer has NULL value due to CRC error.
Second reading try helps, and anchor is always read correctly.
This behavior and fact that if I put more time between array disassembling and assembling it again helps also suggest, that we have some race condition here.
Problem is not in currently monitored in mdmon container but rather in interaction with previous mdmon session that is about to close.

This patch makes that error condition never occurs in this scenario. Grow.c is fixed for correct error condition behavior also.
I can agree that both patches in this series can help for this problem separately, but I think both should be placed in code.

BR
Adam

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html