On Wednesday October 24, dmiller@xxxxxxxxx wrote: > Current mdadm.conf: > DEVICE partitions > ARRAY /dev/.static/dev/md0 level=raid10 num-devices=4 > UUID=9d94b17b:f5fac31a:577c252b:0d4c4b2a auto=part > > still have the problem where on boot one drive is not part of the > array. Is there a log file I can check to find out WHY a drive is not > being added? It's been a while since the reboot, but I did find some > entries in dmesg - I'm appending both the md lines and the physical disk > related lines. The bottom shows one disk not being added (this time is > was sda) - and the disk that gets skipped on each boot seems to be > random - there's no consistent failure: Odd.... but interesting. Does it sometimes fail to start the array altogether? > md: md0 stopped. > md: md0 stopped. > md: bind<sdc> > md: bind<sdd> > md: bind<sdb> > md: md0: raid array is not clean -- starting background reconstruction > raid10: raid set md0 active with 3 out of 4 devices > md: couldn't update array info. -22 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This is the most surprising line, and hence the one most likely to convey helpful information. This message is generated when a process calls "SET_ARRAY_INFO" on an array that is already running, and the changes implied by the new "array_info" are not supportable. The only way I can see this happening is if two copies of "mdadm" are running at exactly the same time and are both are trying to assemble the same array. The first calls SET_ARRAY_INFO and assembles the (partial) array. The second calls SET_ARRAY_INFO and gets this error. Not all devices are included because while when one mdadm when to look, at a device, the other has it locked and so the first just ignored it. I just tried that, and sometimes it worked, but sometimes it assembled with 3 out of 4 devices. I didn't get the "couldn't update array info" message, but that doesn't prove I'm wrong. I cannot imagine how that might be happening (two at once) unless maybe 'udev' had been configured to do something as soon as devices were discovered.... seems unlikely. It might be worth finding out where mdadm is being run in the init scripts and add a "-v" flag, and redirecting stdout/stderr to some log file. e.g. mdadm -As -v > /var/log/mdadm-$$ 2>&1 And see if that leaves something useful in the log file. BTW, I don't think your problem has anything to do with the fact that you are using whole partitions. While it is debatable whether that is a good idea or not (I like the idea, but Doug doesn't and I respect his opinion) I doubt it would contribute to the current problem. Your description makes me nearly certain that there is some sort of race going on (that is the easiest way to explain randomly differing behaviours). The race is probably between different code 'locking' (opening with O_EXCL) the various devices. Give the above error message, two different 'mdadm's seems most likely, but an mdadm and a mount-by-label scan could probably do it too. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html