Re: sb->resync_offset value after resync failure

Alexander Lyakas <alex.bolshoy@xxxxxxxxx> · Wed, 1 Feb 2012 22:56:56 +0200

Hi Neil,
based on your hints I dug some more into resync failure cases, and
handling of sb->resync_offset and mddev->recovery_cp. Here are aome
observations:

The only cases, in which recovery_cp is set (except setting via sysfs,
setting bitmap bits via sysfs etc.) are:
# on array creation, recovery_cp is set to 0 (and bitmap is fully set)
# recovery_cp becomes MaxSector only if MD_RECOVERY_SYNC (with/without
REQUESTED) completes successfully. On raid6 this may happen when a
singly-degraded array completes resync.
# if resync does not complete successfully (MD_RECOVERY_INTR or
crash), then recovery_cp remains valid (not MaxSector).

# The only real influence that recovery_cp seems to have is:
1) abort the assembly if ok_start_degraded is not set
2) when loading the superblock, it looks like recovery_cp may cause
the beginning of the bitmap not being loaded. I did not dig further
into bitmap at this point.
3) resume the resync in md_check_recovery() if recovery_cp is valid.

Are these observations valid?

With this scheme I saw several interesting issues:
# After resync is aborted/interrupted, recovery_cp is updated (either
to MaxSector or another value). However, the superblock is not updated
at this point. If there is no additional activity on the array, the
superblock will not be updated. I saw cases when resync completes
fine, recovery_cp is set to MaxSector, but not persisted in the
superblock. If I crash the machine at this point, then after reboot,
array is still considered as dirty (has valid resync_offset in
superblock). Is there an option to force the superblock update at this
point?

# When resync aborts due to drive failure, then MD_RECOVERY_INTR is
set, and sync_request() returns mddev->dev_sectors - sector_nr. As a
result, recovery_cp is set to device size, and that's what I see in my
raid6 scenario. At this point three things can happen:
1) If there is additional drive to resync (like in raid6), then resync
restarts (from sector 0)
2) If there is a spare, it starts rebuilding the spare, and, as a
result, persists sb->resync_offset==sb->size in the superblock
3) Otherwise, it restarts the resync (due to valid recovery_cp),
immediately finishes it, and sets recovery_cp=MaxSector (but not
necessarily writes the superblock right away).
So basically, we can have a rebuilding drive and
sb->resync_offset==sb->size in the superblock. It will be cleared only
after rebuild completes and this "empty-resync" happens.
Are you (we?) comfortable with this behavior? (Due to mdadm issue I
mentioned earlier, in this case mdadm thinks that array is clean,
while kernel thinks the array is dirty, while it's actually
rebuilding).

Now I will answer to your suggestions:

>> So one question is: should mdadm compare sb->resync_offset to
>> MaxSector and not to sb->size? In the kernel code, resync_offset is
>> always compared to MaxSector.
>
> Yes, mdadm should be consistent with the kernel.  Patches welcome.

I think at this point it's better to consider the array as dirty in
mdadm, and either let the kernel set MaxSectors in this "empty-resync"
or set it via mdadm+force. Let's see if I can submit this trivial
patch.

>> Another question is: whether sb->resync_offset should be set to
>> MaxSector by the kernel as soon as it starts rebuilding a drive? I
>> think this would be consistent with what Neil wrote in the blog entry.
>
> Maybe every time we update ->curr_resync_completed we should update
> ->recovery_cp as well if it is below the new ->curre_resync_completed ??

I'm not sure it will help a lot. Except for not loading part of the
bitmap (which I am unsure about), the real value if recovery_cp does
not really matter. Is this correct?

>
>
>>
>> Here is the scenario to reproduce the issue I described:
>> # Create a raid6 array with 4 drives A,B,C,D. Array starts resyncing.
>> # Fail drive D. Array aborts the resync and then immediately restarts
>> it (it seems to checkpoint the mddev->recovery_cp, but I am not sure
>> that it restarts from that checkpoint)
>> # Re-add drive D to the array. It is added as a spare, array continues resyncing
>> # Fail drive C. Array aborts the resync, and then starts rebuilding
>> drive D. At this point sb->resync_offset is some valid value (usually
>> 0, not MaxSectors and not sb->size).
>
> Does it start the rebuilding from the start?  I hope it does.

Yes, it starts from j==0 (and then it goes according to bitmap I guess
and hope).

>
>> # Stop the array. At this point sb->resync offset is sb->size in all
>> the superblocks.
>
> At some point in there you had a RAID6 with two missing devices, so it is
> either failed or completely in-sync.  I guess we assume the latter.
> Is that wrong?

It's not that it's wrong, but if we assume the array is clean, then
resync_offset should be MaxSectors in the superblock. But it's not, as
I showed earlier.

About exposing ok_start_degraded via sysfs: Is it possible to push
this value through RUN_ARRAY ioctl? I saw that it can carry
mdu_param_t, which has max_fault/*unused for now*/ field? Perhaps this
filed can be used?

I have another question about md_check_recovery() this code:
			if (mddev->safemode &&
			    !atomic_read(&mddev->writes_pending) &&
			    !mddev->in_sync &&
			    mddev->recovery_cp == MaxSector) {
				mddev->in_sync = 1;
Why mddev->recovery_cp == MaxSector is checked here?  This basically
means that if we have a valid recovery_cp, then in_sync will never be
set to 1, and this code:
	if (mddev->in_sync)
		sb->resync_offset = cpu_to_le64(mddev->recovery_cp);
	else
		sb->resync_offset = cpu_to_le64(0);
should always set resync_offset to 0. I saw that "in_sync" basically
tells whether there are pending/in-flight writes.

Finally, what would you recommend to do if a raid5 resync fails? At
this point kernel doesn't fail the array (as you pointed in your
blog), but it machine reboots, array cannot be re-assembled without
"force", because it's dirty+has a failed drive. So it's kind
of...inconsistent?

Thanks for your help,
  Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html