Re: Raid5 reshape

Neil Brown <neilb@xxxxxxx> · Wed, 21 Jun 2006 16:02:39 +1000

On Tuesday June 20, nigel@xxxxxxxxxxxxxx wrote:
> Nigel J. Terry wrote:
> 
> Well good news and bad news I'm afraid...
> 
> Well I would like to be able to tell you that the time calculation now 
> works, but I can't. Here's why: Why I rebooted with the newly built 
> kernel, it decided to hit the magic 21 reboots and hence decided to 
> check the array for clean. The normally takes about 5-10 mins, but this 
> time took several hours, so I went to bed! I suspect that it was doing 
> the full reshape or something similar at boot time.
> 

What "magic 21 reboots"??  md has no mechanism to automatically check
the array after N reboots or anything like that.  Or are you thinking
of the 'fsck' that does a full check every so-often?

> Now I am not sure that this makes good sense in a normal environment. 
> This could keep a server down for hours or days. I might suggest that if 
> such work was required, the clean check is postponed till next boot and 
> the reshape allowed to continue in the background.

An fsck cannot tell if there is a reshape happening, but the reshape
should notice the fsck and slow down to a crawl so the fsck can complete...

> 
> Anyway the good news is that this morning, all is well, the array is 
> clean and grown as can be seen below. However, if you look further below 
> you will see the section from dmesg which still shows RIP errors, so I 
> guess there is still something wrong, even though it looks like it is 
> working. Let me know if i can provide any more information.
> 
> Once again, many thanks. All I need to do now is grow the ext3 filesystem...
.....

> ...ok start reshape thread
> md: syncing RAID array md0
> md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
> md: using maximum available idle IO bandwidth (but not more than 200000 
> KB/sec) for reconstruction.
> md: using 128k window, over a total of 245111552 blocks.
> Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
> <0000000000000000>{stext+2145382632}
> PGD 7c3f9067 PUD 7cb9e067 PMD 0
....
> Process md0_reshape (pid: 1432, threadinfo ffff81007aa42000, task 
> ffff810037f497b0)
> Stack: ffffffff803dce42 0000000000000000 000000001d383600 0000000000000000
>        0000000000000000 0000000000000000 0000000000000000 0000000000000000
>        0000000000000000 0000000000000000
> Call Trace: <ffffffff803dce42>{md_do_sync+1307} 
> <ffffffff802640c0>{thread_return+0}
>        <ffffffff8026411e>{thread_return+94} 
> <ffffffff8029925d>{keventd_create_kthread+0}
>        <ffffffff803dd3d9>{md_thread+248} 

That looks very much like the bug that I already sent you a patch for!
Are you sure that the new kernel still had this patch?

I'm a bit confused by this....

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html