Re: Update to mdadm V3.2.5 => RAID starts to recover (reproducible)

Andreas Baer <synthetic.gods@xxxxxxxxx> · Thu, 5 Sep 2013 17:22:26 +0200

On 9/2/13, NeilBrown <neilb@xxxxxxx> wrote:
> On Thu, 29 Aug 2013 11:55:09 +0200 Andreas Baer <synthetic.gods@xxxxxxxxx>
> wrote:
>
>> On 8/26/13, NeilBrown <neilb@xxxxxxx> wrote:
>> > On Thu, 22 Aug 2013 15:20:06 +0200 Andreas Baer
>> > <synthetic.gods@xxxxxxxxx>
>> > wrote:
>> >
>> >> Short description:
>> >> I've discovered a problem during re-assembly of a clean RAID. mdadm
>> >> throws one disk out because this disk apparently shows another disk as
>> >> failed. After assembly, RAID starts to recover on existing spare disk.
>> >>
>> >> In detail:
>> >> 1. RAID-6 (Superblock V0.90.00) created with mdadm V2.6.4 and with 7
>> >> active disks and 1 spare disk (disk size: 1 TB), fully synced and
>> >> clean.
>> >> 2. RAID-6 stopped and re-assembled with mdadm V3.2.5, but during that
>> >> one disk is thrown out.
>> >>
>> >> Manual assembly command for /dev/md0, relevant partitions are
>> >> /dev/sd[b-i]1:
>> >> # mdadm --assemble --scan -vvv
>> >> mdadm: looking for devices for /dev/md0
>> >> mdadm: no RAID superblock on /dev/sdi
>> >> mdadm: no RAID superblock on /dev/sdh
>> >> mdadm: no RAID superblock on /dev/sdg
>> >> mdadm: no RAID superblock on /dev/sdf
>> >> mdadm: no RAID superblock on /dev/sde
>> >> mdadm: no RAID superblock on /dev/sdd
>> >> mdadm: no RAID superblock on /dev/sdc
>> >> mdadm: no RAID superblock on /dev/sdb
>> >> mdadm: no RAID superblock on /dev/sda1
>> >> mdadm: no RAID superblock on /dev/sda
>> >> mdadm: /dev/sdi1 is identified as a member of /dev/md0, slot 7.
>> >> mdadm: /dev/sdh1 is identified as a member of /dev/md0, slot 6.
>> >> mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 5.
>> >> mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 4.
>> >> mdadm: /dev/sde1 is identified as a member of /dev/md0, slot 3.
>> >> mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot 2.
>> >> mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 1.
>> >> mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 0.
>> >> mdadm: ignoring /dev/sdb1 as it reports /dev/sdi1 as failed
>> >> mdadm: no uptodate device for slot 0 of /dev/md0
>> >> mdadm: added /dev/sdd1 to /dev/md0 as 2
>> >> mdadm: added /dev/sde1 to /dev/md0 as 3
>> >> mdadm: added /dev/sdf1 to /dev/md0 as 4
>> >> mdadm: added /dev/sdg1 to /dev/md0 as 5
>> >> mdadm: added /dev/sdh1 to /dev/md0 as 6
>> >> mdadm: added /dev/sdi1 to /dev/md0 as 7
>> >> mdadm: added /dev/sdc1 to /dev/md0 as 1
>> >> mdadm: /dev/md0 has been started with 6 drives (out of 7) and 1 spare.
>> >>
>> >> I finally made a test by modifying mdadm V3.2.5 sources to not write
>> >> any data to any superblock and to simply exit() somewhere in the
>> >> middle of assembly process to be able to reproduce this behavior
>> >> without any RAID re-creation/synchronization.
>> >> So using mdadm V2.6.4 /dev/md0 assembles without problems and if I
>> >> switch to mdadm V3.2.5 it shows the same messages as above.
>> >>
>> >> The real problem:
>> >> I have more than a single machine receiving a similar software update
>> >> so I need to find a solution or workaround around this problem. By the
>> >> way, from another test without an existing spare disk, there seems to
>> >> be no 'throwing out'-problem when switching from V2.6.4 to V3.2.5.
>> >>
>> >> It would also be a great help if someone could explain the reason
>> >> behind the relevant code fragment for rejecting a device, e.g. why is
>> >> only the 'most_recent' device important?
>> >>
>> >> /* If this device thinks that 'most_recent' has failed, then
>> >>   * we must reject this device.
>> >>   */
>> >> if (j != most_recent &&
>> >>     content->array.raid_disks > 0 &&
>> >>     devices[most_recent].i.disk.raid_disk >= 0 &&
>> >>     devmap[j * content->array.raid_disks +
>> >> devices[most_recent].i.disk.raid_disk] == 0) {
>> >>     if (verbose > -1)
>> >>         fprintf(stderr, Name ": ignoring %s as it reports %s as
>> >> failed\n",
>> >>             devices[j].devname, devices[most_recent].devname);
>> >>     best[i] = -1;
>> >>     continue;
>> >> }
>> >>
>> >> I also attached some files showing some details about related
>> >> superblocks before and after assembly as well as about RAID status
>> >> itself.
>> >
>> >
>> > Thanks for the thorough report.  I think this issue has been fixed in
>> > 3.3-rc1
>> > You can fix it for 3.2.5 by applying the following patch:
>> >
>> > diff --git a/Assemble.c b/Assemble.c
>> > index 227d66f..bc65c29 100644
>> > --- a/Assemble.c
>> > +++ b/Assemble.c
>> > @@ -849,7 +849,8 @@ int Assemble(struct supertype *st, char *mddev,
>> >  		devices[devcnt].i.disk.minor = minor(stb.st_rdev);
>> >  		if (most_recent < devcnt) {
>> >  			if (devices[devcnt].i.events
>> > -			    > devices[most_recent].i.events)
>> > +			    > devices[most_recent].i.events &&
>> > +			    devices[devcnt].i.disk.state == 6)
>> >  				most_recent = devcnt;
>> >  		}
>> >  		if (content->array.level == LEVEL_MULTIPATH)
>> >
>> > The "most recent" device is important as we need to choose one to
>> > compare
>> > all
>> > others again.  The problem is that the code in 3.2.5 can sometimes
>> > choose a
>> > spare, which isn't such a good idea.
>> >
>> > The "most recent" is also important because when a collection of devices
>> > is given to the kernel it will give priority to some information which is
>> > on the
>> > last device passed in.  So we make sure that the last device given to
>> > the kernel is the "most recent".
>> >
>> > Please let me know if the patch fixes your problem.
>> >
>> > NeilBrown
>>
>> First of all, thanks for your very helpful 'most recent disk'
>> explanation.
>>
>> Sadly, the patch didn't fix my problem because the event counters are
>> really equal on all disks (inclusive spare) and the first disk that is
>> checked is the spare disk so there is no reason to set another disk as
>> 'most recent disk', but I improved your patch a little bit by
>> providing more output and created also an own solution, but that needs
>> review because I'm not sure if it can be done like that.
>>
>> Patch 1: Your solution with more output
>> Diff: mdadm-3.2.5-noassemble-patch1.diff
>> Assembly: mdadm-3.2.5-noassemble-patch1.txt
>>
>> Patch 2: My proposed solution
>> Diff: mdadm-3.2.5-noassemble-patch2.diff
>> Assembly: mdadm-3.2.5-noassemble-patch2.txt
>
>
> Thanks for the testing and suggestions.  I see what I missed now.
> Can you check if this patch works please?
>
> Thanks.
> NeilBrown
>
> diff --git a/Assemble.c b/Assemble.c
> index 227d66f..9131917 100644
> --- a/Assemble.c
> +++ b/Assemble.c
> @@ -215,7 +215,7 @@ int Assemble(struct supertype *st, char *mddev,
>  	unsigned int okcnt, sparecnt, rebuilding_cnt;
>  	unsigned int req_cnt;
>  	int i;
> -	int most_recent = 0;
> +	int most_recent = -1;
>  	int chosen_drive;
>  	int change = 0;
>  	int inargv = 0;
> @@ -847,8 +847,9 @@ int Assemble(struct supertype *st, char *mddev,
>  		devices[devcnt].i = *content;
>  		devices[devcnt].i.disk.major = major(stb.st_rdev);
>  		devices[devcnt].i.disk.minor = minor(stb.st_rdev);
> -		if (most_recent < devcnt) {
> -			if (devices[devcnt].i.events
> +		if (devices[devcnt].i.disk_state == 6) {
> +			if (most_recent < 0 ||
> +			    devices[devcnt].i.events
>  			    > devices[most_recent].i.events)
>  				most_recent = devcnt;
>  		}

Your patch seems to work without issues.

There is only a small typo:
+		if (devices[devcnt].i.disk_state == 6) {
should be:
+		if (devices[devcnt].i.disk.state == 6) {

I attached the patch that I'm finally using to this mail.
Thank you very much for your help.
Attachment:
no-spare-as-most_recent.patch

Description: Binary data