On Sun, 11 Dec 2011 09:03:14 +0200 Anssi Hannula <anssi.hannula@xxxxxx> wrote: > Hi! > > After I rebooted during a raid6 rebuild, the rebuild didn't start again. > Instead, there is a flood of "RAID conf printout"s that seemingly happen > on array activity. > > All the devices show up properly in --detail and two devices are marked > as "spare rebuilding", and I can access the contents of the array just > fine, but the rebuild doesn't actually start. Is this a bug or am I > missing something? :) > > I was initially on 2.6.38.8, but also tried 3.1.4 which seems to have > the same issue. mdadm is 3.1.5. > > I'm not using start_ro and writing to the array doesn't trigger a > rebuild either. > > Attached are --examine outputs before assembly, kernel log output on > assembly, /proc/mdstat and --detail after assembly (on 3.1.4). > Thank you for the very detailed problem report. Unfortunately it is a complete mystery to me what is happening. The repeated "RAID conf printout" messages are almost certainly coming from the end of raid5_remove_disk. It is being called from remove_and_add_spares for each of the two devices that are being rebuilt. raid5_remove_disk declines to remove them because it can keep rebuilding them. remove_and_add_spares then counts them and notes there are 2. md_check_recovery notes that this is > 0, so it should create a thread to run md_do_sync. md_do_sync should then print out a message like md: recovery of RAID array md0 but it doesn't. So something went wrong. There are three reasons that md_do_sync might not print a message: 1/ MD_RECOVERY_DONE is set. As only md_do_sync ever sets it, that is unlikely, and in any case md_check_recovery clears it. 2/ mddev->ro != 0. It is only ever set to 0, 1, or 2. If it is 1 or 2 then we would be able to see that in /proc/mdstat as a "(readonly)" status. But we don't. 3/ MD_RECOVERY_INTR is set. Again, md_check_recovery clears this. It does get set if kthread_should_stop() returns 'true', but that should only happen if kthread_stop() was called. That is only called by md_unregister_thread and I cannot see any way that could be call. So. No idea. Are you compiling these kernels yourself? If so, could you: - put a printk in the top of md_do_sync to report the values of mddev->recovery and mddev->ro - print a message whenever md_unregister_thread is called - in md_check_recovery, in the if (mddev->ro) { /* Only thing we do on a ro array is remove * failed devices. */ mdk_rdev_t *rdev; in statement, print the value of mddev->ro. Then see which of those printk's fire, and what they tell us. NeilBrown
Attachment:
signature.asc
Description: PGP signature