Re: RAID scrubbing

Justin Maggard <jmaggard10@xxxxxxxxx> · Fri, 16 Apr 2010 17:03:50 -0700

On Wed, Apr 14, 2010 at 6:22 PM, Neil Brown <neilb@xxxxxxx> wrote:
> On Wed, 14 Apr 2010 17:51:11 -0700
> Justin Maggard <jmaggard10@xxxxxxxxx> wrote:
>
>> On Fri, Apr 9, 2010 at 7:01 PM, Michael Evans <mjevans1983@xxxxxxxxx> wrote:
>> > On Fri, Apr 9, 2010 at 6:46 PM, Justin Maggard <jmaggard10@xxxxxxxxx> wrote:
>> >> On Fri, Apr 9, 2010 at 6:41 PM, Michael Evans <mjevans1983@xxxxxxxxx> wrote:
>> >>> On Fri, Apr 9, 2010 at 6:28 PM, Justin Maggard <jmaggard10@xxxxxxxxx> wrote:
>> >>>> Hi all,
>> >>>>
>> >>>> I've got a system using two RAID5 arrays that share some physical
>> >>>> devices, combined using LVM.  Oddly, when I "echo repair >
>> >>>> /sys/block/md0/md/sync_action", once it finishes, it automatically
>> >>>> starts a repair on md1 also, even though I haven't requested it.
>> >>>> Also, if I try to stop it using "echo idle >
>> >>>> /sys/block/md0/md/sync_action", a repair starts on md1 within a few
>> >>>> seconds.  If I stop that md1 repair immediately, sometimes it will
>> >>>> respawn and start doing the repair again on md1.  What should I be
>> >>>> expecting here?  If I start a repair on one array, is it supposed to
>> >>>> automatically go through and do it on all arrays sharing that
>> >>>> personality?
>> >>>>
>> >>>> Thanks!
>> >>>> -Justin
>> >>>>
>> >>>
>> >>> Is md1 degraded with an active spare?  It might be delaying resync on
>> >>> it until the other devices are idle.
>> >>
>> >> No, both arrays are redundant.  I'm just trying to do scrubbing
>> >> (repair) on md0; no resync is going on anywhere.
>> >>
>> >> -Justin
>> >>
>> >
>> > First: Reply to all.
>> >
>> > Second, if you insist that things are not as I suspect:
>> >
>> > cat /proc/mdstat
>> >
>> > mdadm -Dvvs
>> >
>> > mdadm -Evvs
>> >
>>
>> I insist it's something different. :)  Just ran into it again on
>> another system.  Here's the requested output:
>
> Thanks.  Very thorough!
>
>
>> Apr 14 17:32:23 JMAGGARD kernel: md: requested-resync of RAID array md2
>> Apr 14 17:32:23 JMAGGARD kernel: md: minimum _guaranteed_  speed: 1000
>> KB/sec/disk.
>> Apr 14 17:32:23 JMAGGARD kernel: md: using maximum available idle IO
>> bandwidth (but not more than 200000 KB/sec) for requested-resync.
>> Apr 14 17:32:23 JMAGGARD kernel: md: using 128k window, over a total
>> of 972041296 blocks.
>> Apr 14 17:32:51 JMAGGARD kernel: md: md_do_sync() got signal ... exiting
>> Apr 14 17:33:35 JMAGGARD kernel: md: requested-resync of RAID array md3
>
> So we see the requested-resync (repair) of md2 started as you requested,
> then finished at 17:32:51 when you write 'idle' to 'sync_action'.
>
> Then 44 seconds later a similar repair started on md3.
> 44 seconds is too long for it to be a direct consequence of the md2 repair
> stopping.  Something *must* have written to md3/md/sync_action.   But what?
>
> Maybe you have "mdadm --monitor" running and it notices when repair on one
> array finished and has been told to run a script (--program or PROGRAM in
> mdadm.conf) which would then start a repair on the next array???
>
> Seems a bit far-fetched, but I'm quite confident that some program must be
> writing to md3/md/sync_action while you're not watching.
>
> NeilBrown

Well, this is embarrassing.  You're exactly right. :)  Looks like it
was a bug in the script run by mdadm --monitor.  Thanks for the
insight!

-Justin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html