On Thu, 2006-11-09 at 16:32 +1100, Neil Brown wrote: > On Thursday November 2, dledford@xxxxxxxxxx wrote: > > If I use mdadm 2.5.4 to create a version 1 superblock raid1 device, it > > starts a resync. If I then reboot the computer part way through, when > > it boots back up, the resync gets cancelled and the array is considered > > clean. This is against a 2.6.18.1 kernel. > > I cannot reproduce that (I tried exactly 2.6.18.1). > Do you have kernel logs of the various stages? No, I don't. It could however be related to the fact that I built the array, let it sync completely, then rebuilt the array with a new superblock without doing an mdadm --zero-superblock on each device first (I was playing with different options). After that second build, it was about 66% done resyncing when I decided to reboot and try out the "restart from where it left off" feature of the version 1 superblock and when I rebooted, to my dismay, it was marked as clean and no sync was in progress. So, I used mdadm to fail/remove/add the second device from the array. To my surprise, it went in clean *again*. This time I think that was intentional though, I don't think I had dirtied the array between the fail and add, so the generation counts matched and mdadm or the kernel decided it was safe to put it back in the array without a resync (I know a *lot* of our customers are going to question the safety of that...the logic may be sound, but it's going to make them nervous when a fail/remove/add cycle doesn't trigger a resync, so I don't know if you've documented the exact logic in use when doing that, but it's no doubt going to fall under scrutiny). So, the next time I did a fail/remove/zero-superblock/add cycle and that triggered the full resync that I wanted. So, my guess is that because I didn't zero the superblocks between the initial creation and recreation with a different --name option (I think that was the only thing I changed), it may have triggered something in the code path that detects what *should* be a safe remove/add cycle and stopped the resync on reboot from happening. > > > > If I create a version 1 superblock raid1 array, mdadm -D <constituent > > device> says that the device is not part of a raid array (and likewise > > the kernel autorun facility fails to find the device). > > mdadm -D <constituent device> > is incorrect usage. You want > mdadm -E <constituent device> > or > mdadm -D <assembled array> > > in-kernel autorun does not work with version 1 metadata as it does not > store a 'preferred minor'. No, but it has a name, so theoretically, it could assemble it and use the name component without regard to the minor number. Regardless, now that FC6 is out the door, I want to make some change for FC7 and I'm trying to get the changes in place so we don't use autorun, so it may very well be a moot point as far as we are concerned (although the messages from the kernel that the devices don't have a superblock might confuse people, I think it would be better to print out a kernel message to the effect that autorun doesn't work on version 1 superblocks: skipping instead of <sda1> no superblock found or whatever it says now). > > If I create a version 1 superblock raid1 array, mdadm -E <constituent > > device> sees the superblock. If I then run mdadm -E --brief on that > > same device, it prints out the 1 line ARRAY line, but it misprints the > > UUID such that is a 10 digit hex number: 8 digit hex number: 8 digit hex > > number: 6 digit hex number. > > Oops, so it does. Fix below. Thanks. Thanks. > > It also prints the mdadm device in the > > ARRAY line as /dev/md/# where as mdadm -D --brief prints the device > > as /dev/md#. Consistency would be nice. > > It would be nice .... but how important is it really? > If you create the same array with --name=fred, then > -Eb with give /dev/md/fred, while -Db will give /dev/md2. > Both a right in a sense. > > Ofcourse if you say > mdadm -Db /dev/md/2 > then you get /dev/md/2. > You only have /dev/md2 force on you with > mdadm -Ds > In that case mdadm doesn't really know what name you want to use for > the md device.... I guess it could scan /dev. Well, given the clarification above about the difference between -D and -E, and that no minor information is stored in version 1 superblocks, I can see where -Eb has no choice but to use the name and omit the minor information entirely (unless you did something like checked to see if that array was currently assembled, and if so what minor it is currently on, but that's overkill IMO). However, what that means to me is that in future products, I need to teach the OS to ignore major/minor of the device and *only* use the name if we switch to version 1 superblocks (which I would like to do). The reason is that if you ever loose your mdadm.conf file, you can't get consistent major/minor information back by going to the device, only the name information. So, ignore major/minor completely, let the md stack use whatever minor it wants from boot to boot, and rely entirely on the name the device would get under /dev/md/ for uses in things like lvm2 configs, fstab, etc. Now, the only question that leaves for me is some way to tell a partitioned device from a normal device. Obviously, this isn't something that needs to be in upstream, but a standardized way of telling the difference for the purpose of rescue disks and the like will be necessary for our distribution I think. I'm thinking that if the name ends in _p# where # is some integer, then lop off the _p# part from the part of the name we use when making the whole device and make a partitioned array with that number of partitions ending in _p1 to _p#. It would preclude the use of certain possible names, but I think that should be sufficient. > > > > Does the superblock still not store any information about whether or not > > the array is a single device or partitionable? Would be nice if the > > superblock gave some clue as to that fact so that it could be used to > > set the auto= param on an mdadm -E --brief line to the right mode. > > No the superblock does not store any information about where or not > the array has partitions. Presumably the partition table does ... > > I would recommend that if you want to do everything automatically, > then assemble all arrays as partitionable. Then you can use it with > or without partitions. I'd prefer not to go this route just in case a non-partitioned array might end up having some bogus info that looks similar to a partitioned array and results in out of range partitions. > > Mdadm assumes that the --name option passed in to create an array means > > something in particular to the md array name and modifies subsequent > > mdadm -D --brief and mdadm -E --brief outputs to include the name option > > minus the hostname. Aka, if I set the name to firewall:/boot, mdadm -E > > --brief will then print out the ARRAY line device as /dev/md//boot. I > > don't think this is documented anywhere. This also raises the question > > of how partitionable md devices will be handled in regards to their name > > component. > > The 'Auto Assembly' section talks about this a bit, though not in the > context of --examine. > > The documentation for --auto suggest that in the above case, > partitions would be > /dev/md/boot1 > /dev/md/boot2 > > I'm always keen to improve the documentation. What would you like > included where? Well, the main thing I was pointing out was that if the name matches the filesystem's label on the device, aka /boot for the boot partition and firewall:/boot for the md device, that mdadm double prints the slash. That wasn't a typo in my comment, that's what mdadm does. On a separate note, can you let me know what difference the version 1.0, 1.1, and 1.2 superblocks would make to the possible size of internal write intent bitmaps? With my particular disks I'm testing on, they're about 320G each, and the intent bitmap is limited to 5 pages with internal and superblock format 1.0. For larger arrays, I could see wanting a larger internal bitmap. Do any of the other superblock options allow for that? Another thing I'm thinking about doing is defaulting to an mdadm setup that precludes the information from being easily accessible when the array isn't running. Aka, a superblock at the beginning of the array instead of at the end. While being able to mount a constituent md device when the array isn't running may be good for convenience, it presents a special problem for consistency guarantees since you can mount the device rw and the md stack would never know about it and not know to resync the array should you get it repaired. Now imagine if you have a partitionable array, and you repartitioned the constituent device by mistake, and then the array was reassembled. You wouldn't know for sure whether you would get the old or new partition. Talk about a nightmare. I wouldn't worry so much about this if in my testing I didn't have it happen to me when the initrd failed to assemble the right array and as a result the lvm stack found its metadata on both constituent devices and then randomly chose which one to use to start the lvm logical volumes. If it hadn't been for the init sequence dropping into maintenance mode, my system would have been screwed royally. Bad, bad, bad, bad, bad. If I do things the other way around, where the superblock is at the front and shifts all data down the device, then at least that would mean that if the array fails to start, nothing else will get found. And a rescue CD could always start the array in degraded mode to allow you to get to your data, and at least when you start in degraded mode future reboots will always know which device is up to date and which device was left out in degraded mode and resync appropriately. The only real problem I see with this approach is getting grub/lilo to work with the /boot partition. It would probably have to be taught how to calculate the proper offset into the array depending on whether it was partitioned or not. So, for example, if it's not a partitioned array, you would have to teach grub that, let's say you have your boot data on (hd0,0), then if (hd0,0) is part of a raid array with certain superblock types (probably have to read /proc/mdstat to know), then the start of (hd0,0) is not the start of the partition, but instead something like partition size in blocks - whole md device size in blocks = offset into partition to start of md device, and consequently the ext filesystem that /boot is comprised of. If it is partitioned, then you could teach it the notion of (hd0,0,0), aka chained partition tables, where you use the same offset calculation above to get to the chained partition table, then read that partition table to get the offset to the filesystem. I don't think it would be too difficult for grub, but it would have to be added. This does, however, point out that the md stack's decision to use a geometry on it's devices that is totally different than the real constituent device geometry means that grub would have to perform conversions on that chained partition table to get from md offset to real device offset. That may not matter much in the end, but it will have to be done. The difference in geometry also precludes doing a whole device md array with the superblock at the end and the partition table where the normal device partition table would be. Although that sort of setup is risky in terms of failure to assemble like I pointed out above, it does have it's appeal for certain situations like multipath or the ability to have a partitioned raid1 device with /boot in the array without needing to modify grub, especially on machines that don't have built in SATA raid that dm-raid could make use of. -- Doug Ledford <dledford@xxxxxxxxxx> GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband
Attachment:
signature.asc
Description: This is a digitally signed message part