Re: weird issues with raid1

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Dec 15, 2008 at 12:00 AM, Neil Brown <neilb@xxxxxxx> wrote:
> On Friday December 5, jnelson-linux-raid@xxxxxxxxxxx wrote:
>> I set up a raid1 between some devices, and have been futzing with it.
>> I've been encountering all kinds of weird problems, including one
>> which required me to reboot my machine.
>>
>> This is long, sorry.
>>
>> First, this is how I built the raid:
>>
>> mdadm --create /dev/md10 --level=1 --raid-devices=2 --bitmap=internal
>> /dev/sdd1 --write-mostly --write-behind missing
>
> 'write-behind' is a setting on the bitmap and applies to all
> write-mostly devices, so it can be specified anywhere.
> 'write-mostly' is a setting that applies to a particular device, not
> to a position in the array.  So setting 'write-mostly' on a 'missing'
> device has no useful effect.  When you add a new device to the array
> you will need to set 'write-mostly' on that if you want that feature.

Aha! Good to know.

>   mdadm /dev/md10 --add --write-mostly /dev/nbd0

..

>> Then I failed and removed /dev/sdd1, and added /dev/sda:
>>
>> mdadm /dev/md10 --fail /dev/sdd1 --remove /dev/sdd1
>> mdadm /dev/md10 --add /dev/sda
>>
>> I let it rebuild.
>>
>> Then I failed, and removed it:
>>
>> The --fail worked, but the --remove did not.
>>
>> mdadm /dev/md10 --fail /dev/sda --remove /dev/sda
>> mdadm: set /dev/sda faulty in /dev/md10
>> mdadm: hot remove failed for /dev/sda: Device or resource busy
>
> That is expected.  Marking a device a 'failed' does not immediately
> disconnect it from the array.  You have to wait for any in-flight IO
> requests to complete.

Aha! Got it.

>> OK. Better, but weird.
>> Since I'm using bitmaps, I would expect --re-add to allow the rebuild
>> to pick up where it left off. It was 78% done.
>
> Nope.
> With v0.90 metadata, a spare device is not marked a being part of the
> array until it is fully recovered.  So if you interrupt a recovery
> there is no record how far it got.
> With v1.0 metadata we do record how far the recovery has progressed
> and it can restart.  However I don't think that helps if you fail a
> device - only if you stop the array and later restart it.
>
> The bitmap is really about 'resync', not 'recovery'.

OK, so task 1: switch to 1.0 (1.1, 1.2) metadata. That's going to
happen as soon as my raid10,f2 'check' is complete.

However, it raises a question: bitmaps are about 'resync' not
'recovery'?  How do they differ?

>> Question 1:
>> I'm using a bitmap. Why does the rebuild start completely over?
>
> Because the bitmap isn't used to guide a rebuild, only a resync.
>
> The effect of --re-add is to make md do a resync rather than a rebuild
> if the device was previously a fully active member of the array.

Aha!  This explains a question I raised in another email. What
happened there is a previously fully active member of the raid got
added, somehow, as a spare, via --incremental. That's when the entire
raid thought it needed to be rebuilt. How did that (the device being
treated as a spare instead of as a previously fully active member)
happen?

>> 4% into the rebuild, this is what --examine-bitmap looks like for both
>> components:
>>
>>         Filename : /dev/sda
>>            Magic : 6d746962
>>          Version : 4
>>             UUID : 542a0986:dd465da6:b224af07:ed28e4e5
>>           Events : 500
>>   Events Cleared : 496
>>            State : OK
>>        Chunksize : 256 KB
>>           Daemon : 5s flush period
>>       Write Mode : Allow write behind, max 256
>>        Sync Size : 78123968 (74.50 GiB 80.00 GB)
>>           Bitmap : 305172 bits (chunks), 305172 dirty (100.0%)
>>
>> turnip:~ # mdadm --examine-bitmap /dev/nbd0
>>         Filename : /dev/nbd0
>>            Magic : 6d746962
>>          Version : 4
>>             UUID : 542a0986:dd465da6:b224af07:ed28e4e5
>>           Events : 524
>>   Events Cleared : 496
>>            State : OK
>>        Chunksize : 256 KB
>>           Daemon : 5s flush period
>>       Write Mode : Allow write behind, max 256
>>        Sync Size : 78123968 (74.50 GiB 80.00 GB)
>>           Bitmap : 305172 bits (chunks), 0 dirty (0.0%)
>>
>>
>> No matter how long I wait, until it is rebuilt, the bitmap on /dev/sda
>> is always 100% dirty.
>> If I --fail, --remove (twice) /dev/sda, and I re-add /dev/sdd1, it
>> clearly uses the bitmap and re-syncs in under 1 second.
>
> Yes, there is a bug here.
> When an array recovers on to a hot space it doesn't copy the bitmap
> across.  That will only happen lazily as bits are updated.
> I'm surprised I hadn't noticed that before, so they might be more to
> this than I'm seeing at the moment.   But I definitely cannot find
> code to copy the bitmap across.  I'll have to have a think about
> that.

ok.

>> Question 2: mdadm --detail and cat /proc/mdstat do not agree:
>>
>> NOTE: mdadm --detail says the rebuild status is 0% complete, but cat
>> /proc/mdstat shows it as 7%.
>> A bit later, I check again and they both agree - 14%.
>> Below, from when the rebuild was 7% according to /proc/mdstat
>
> I cannot explain this except to wonder if 7% of the recovery
> completed between running "mdadm -D" and "cat /proc/mdstat".
>
> The number report by "mdadm -D" is obtained by reading /proc/mdstat
> and applying "atoi()" to the string that ends with a '%'.

OK. As I see it, there are three issues here:

1. somehow a previously fully-active member got re-added (via
--incremental) as a spare instead simply re-added, forcing a full
rebuild.

2. new raid member bitmap weirdness (the bitmap doesn't get copied
over on new members, causing all sorts of weirdness).

3. The unexplained difference between mdadm --detail and cat /proc/mdstat

I have a few more questions / observations I'd like to make but I'll
do those in another email.

Thanks for your response(s)!

-- 
Jon
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux