Re: "bitmap file is out of date, doing full recovery"

Alexander Lyakas <alex.bolshoy@xxxxxxxxx> · Sun, 14 Dec 2014 14:11:05 +0200

Hi Neil,
This issue keeps happening to us. Do you see any problem in always
incrementing the event count?

Thanks,
Alex.

On Tue, Nov 4, 2014 at 11:17 AM, Alexander Lyakas
<alex.bolshoy@xxxxxxxxx> wrote:
> Hi Neil,
> thank you for your comments.
>
> On Wed, Oct 29, 2014 at 1:19 AM, NeilBrown <neilb@xxxxxxx> wrote:
>> On Thu, 23 Oct 2014 19:04:48 +0300 Alexander Lyakas <alex.bolshoy@xxxxxxxxx>
>> wrote:
>>
>>> Hi Neil,
>>> I found at least one way of this happening. The problem is that in
>>> md_update_sb() we allow to decrease the event count:
>>>
>>>     /* If this is just a dirty<->clean transition, and the array is clean
>>>      * and 'events' is odd, we can roll back to the previous clean state */
>>>     if (nospares
>>>         && (mddev->in_sync && mddev->recovery_cp == MaxSector)
>>>         && mddev->can_decrease_events
>>>         && mddev->events != 1) {
>>>         mddev->events--;
>>>         mddev->can_decrease_events = 0;
>>>
>>> Then we call bitmap_update_sb(). If we crash after we update (the
>>> first or all of) bitmap superblocks, then after reboot, we will see
>>> that bitmap event count is less than MD superblock event count. Then
>>> we decide to do full resync.
>>>
>>> This can be easily reproduced by hacking bitmap_update_sb() to call
>>> BUG(), after it calls write_page() in case event count was decreased.
>>>
>>> Why we are decreasing the event count??? Can we always increase it?
>>> u64 is a lot to increase...
>>
>> The reason for decreasing the event count is so that we don't need to update
>> the event count on spares - they can be left spun down.
>> We for simple clean/dirty transitions with increment for clean->dirty and
>> decrement for dirty->clean.  But we should only use this optimisation when
>> everything is simple.
>> We really shouldn't do this when the array is degraded.
>> Do this fix your problem?
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 2c73fcb82593..98fd97b10e13 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -2244,6 +2244,7 @@ repeat:
>>          * and 'events' is odd, we can roll back to the previous clean state */
>>         if (nospares
>>             && (mddev->in_sync && mddev->recovery_cp == MaxSector)
>> +           && mddev->degraded == 0
>>             && mddev->can_decrease_events
>>             && mddev->events != 1) {
>>                 mddev->events--;
>>
>>
> No, unfortunately, this doesn't fix the problem. In my case, the array
> is never degraded. Both drives are present and operational, then the
> box crashes, and after reboot the bitmap event counter is lower than
> we expect. Again, this is easily reproduced by hacking
> bitmap_update_sb() as I mentioned earlier.
>
> In my case array does not have spares. (There is some other system on
> top, which monitors the array, and, if needed, adds a spare from a
> "global" spare pool). Is this ok in this case to always increment the
> event count?
>
> Thanks,
> Alex.
>
>
>>>
>>> Some other doubt that I have is that bitmap_unplug() and
>>> bitmap_daemon_work() call write_page() on page index=0. This page
>>> contains both the superblock and also some dirty bits (could not we
>>> waste 4KB on bitmap superblock???). I am not sure, but I wonder
>>> whether this call can race with md_update_sb (which explicitly calls
>>> bitmap_update_sb), and somehow write the outdated superblock, after
>>> bitmap_update_sb has completed writing it.
>>>
>>
>> storage.sb_page is exactly the same as storage.filemap[0]
>> So once an update has happened, the "outdated superblock" doesn't exist
>> anywhere to be written out from.
>>
>>> Yet another suspect is when loading the bitmap we basically load it
>>> from the first up-to-date drive. Maybe we should have scanned all the
>>> bitmap superblocks, and selected one that has the higher event count
>>> (although as we saw "higher" does not necessarily mean "more
>>> up-to-date").
>>>
>>> Anyways, back to decrementing the event count. Do you see any issue
>>> with not doing this and always incrementing?
>>>
>>> Thanks,
>>> Alex.
>>>
>>
>> Thanks,
>> NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html