Re: Bugreport ddf rebuild problems

NeilBrown <neilb@xxxxxxx> · Thu, 8 Aug 2013 10:40:56 +1000

On Tue, 06 Aug 2013 23:26:30 +0200 Martin Wilck <mwilck@xxxxxxxx> wrote:

> On 08/06/2013 02:16 AM, NeilBrown wrote:
> > On Mon, 05 Aug 2013 23:24:28 +0200 Martin Wilck <mwilck@xxxxxxxx> wrote:
> > 
> >> Hi Albert, Neil,
> >>
> >> I just submitted a new patch series; patch 3/5 integrates your 2nd case
> >> as a new unit test and 4/5 should fix it.
> >>
> >> However @Neil: I am not yet entirely happy with this solution. AFAICS
> >> there is a possible race condition here, if a disk fails and mdadm -CR
> >> is called to create a new array before the metadata reflecting the
> >> failure is written to disk. If a disk failure happens in one array,
> >> mdmon will call reconcile_failed() to propagate the failure to other
> >> already known arrays in the same container, by writing "faulty" to the
> >> sysfs state attribute. It can't do that for a new container though.
> >>
> >> I thought that process_update() may need to check the kernel state of
> >> array members against meta data state when a new VD configuration record
> >> is received, but that's impossible because we can't call open() on the
> >> respective sysfs files. It could be done in prepare_update(), but that
> >> would require major changes, I wanted to ask you first.
> >>
> >> Another option would be changing manage_new(). But we don't seem to have
> >> a suitable metadata handler method to pass the meta data state to the
> >> manager....
> >>
> >> Ideas?
> > 
> > Thanks for the patches - I applied them all.
> 
> I don't see them in the public repo yet.
> 
> > Is there a race here?  When "mdadm -C" looks at the metadata the device will
> > either be an active member of another array, or it will be marked faulty.
> > Either way mdadm won't use it.
> 
> That's right, thanks.
> 
> > If the first array was created to use only (say) half of each device and the
> > second array was created with a size to fit in the other half of the device
> > then it might get interesting.
> > "mdadm -C" might see that everything looks good, create the array using the
> > second half of that drive that has just failed, and give that info to mdmon.
> 
> Yes, I have created a test case for this (10ddf-fail-create-race) which
> I am going to submit soon.
> 
> > I suspect that ddf_open_new (which currently looks like it is just a stub)
> > needs to help out here.
> 
> Great idea, I made an implementation. I found that I needed to freeze
> the array in Create(), too, to avoid the kernel starting a rebuild
> before the mdmon checked the correctness of the new array. Please review
> that, I'm not 100% positive I got it right.
> 
> > When manage_new() gets told about a new array it will collect relevant info
> > from sysfs and call ->open_new() to make sure it matches the metadata.
> > ddf_open_new should check that all the devices in the array are recorded as
> > working in the metadata.  If any are failed, it can write 'faulty' to the
> > relevant state_fd.
> > 
> > Possibly the same thing can be done generically in manage_new() as you
> > suggested.  After the new array has been passed over to the monitor thread,
> > manage_new() could check if any devices should be failed much like
> > reconcile_failed() does and just fail them.
> > 
> > Does that make any sense?  Did I miss something?
> 
> It makes a lot of sense.
> 
> While testing, I found another minor problem case:
> 
>  1 disk fails in array taking half size
>  2 mdmon activates spare
>  3 mdadm -C is called and finds old meta data, allocates extent at
> offset 0 on the spare
>  4 Create() gets an error writing to the "size" sysfs attribute because
> offset 0 has been grabbed by the spare recovery already
> 
> That's not too bad, after all, because the array won't be created. The
> user just needs to re-issue his mdadm -C command which will now succeed
> because the meta data should have been written to disk in the meantime.
> 
> That said, some kind of locking between mdadm and mdmon (mdadm won't
> read meta data as long as mdmon is busy writing them) might be
> desirable. It would be even better to do all meta data operations
> through mdmon, mdadm just sending messages to it. That would be a major
> architectural change for mdadm, but it would avoid this kind of
> "different meta data here and there" problem altogether.

I think the way this locking "should" happen is with an O_EXCL lock on the
container.
mdadm and mdmon already do this to some extent, but it is hard to find :-)

The most obvious usage is that mdmon insists on getting an O_EXCL open before
it will exit.  --create also gets an O_EXCL open, so mdmon cannot
exit while a new array might be being created.
See "When creating a member, we need to be careful" in Create.c and
" No interesting arrays, or we have been told to" in monitor.c

So this race could be closed by manager getting an O_EXCL open on the
container before performing spare management.

Thanks,
NeilBrown
Attachment:
signature.asc

Description: PGP signature