On 03/29/2010 08:46 PM, Dan Williams wrote: > I think the disconnect in the imsm case is that the container to > DOMAIN relationship is N:1, not 1:1. The mdadm notion of an > imsm-container correlates directly with a 'family' in the imsm > metadata. The rules of a family are: > > 1/ All family members must be a member of all defined volumes. For > example with a 4-drive container you could not simultaneously have a > 4-drive (sd[abcd]) raid10 and a 2-drive (sd[ab]) raid1 volume because > any volume would need to incorporate all 4 disks. Also, per the rules > if you create two raid1 volumes sd[ab] and sd[cd] those would show up > as two containers. > > 2/ A spare drive does not belong to any particular family > ('family_number' is undefined for a spare). The Windows driver will > automatically use a spare to fix any degraded family in the system. > In the mdadm/mdmon case since we break families into containers we > need a mechanism to migrate spare devices between containers because > they are equally valid hot spare candidate for any imsm container in > the system. This explains the weird behavior I got when trying to create arrays on my IMSM box via the BIOS. Thanks for the clear explanation of family delineation. > This begs the question, why not change the definition of an imsm > container to incorporate anything with imsm metadata? This definitely > would make spare management easier. This was an early design decision > and had the nice side effect that it lined up naturally with the > failure and rebuild boundaries of a family. I could give it more > thought, but right now I believe there is a lot riding on this 1:1 > container-to-family relationship, and I would rather not go there. I'm fine with the container being family based and not domain based. I just didn't realize that distinction existed. It's all cleared up now ;-) >> However, that just means (to me anyway) that I would treat all of the >> sata ports as one domain with multiple container arrays in that domain >> just like we can have multiple native md arrays in a domain. If a disk >> dies and we hot plug a new one, then mdadm would look for the degraded >> container present in the domain and add the spare to it. It would then >> be up to mdmon to determine what logical volumes are currently degraded >> and slice up the new drive to work as spares for those degraded logical >> volumes. Does this sound correct to you, and can mdmon do that already >> or will this need to be added? > > This sounds correct, and no mdmon cannot do this today. The current > discussions we (Marcin and I) had with Neil offlist was extending > mdadm --monitor to handle spare migration for containers since it > already handles spare migration for native md arrays. It will need > some mdmon coordination since mdmon is the only agent that can > disambiguate a spare from a stale device at any given point in time. So we'll need to coordinate on this aspect of things then. I'll keep you updated as I get started implementing this if you want to think about how you would like to handle this interaction between mdadm/mdmon. As far as I can tell, we've reached a fairly decent consensus on things. But, just to be clear, I'll reiterate that concensus here: Add a new linetype: DOMAIN with options path= (must be specified at least once for any domain action other than none and incremental and must be something other than a global match for any action other than none and incremental) and metadata= (specifies the metadata type possible for this domain as one of imsm/ddf/md, and where for imsm or ddf types, we will verify that the path portions of the domain do not violate possible platform limitations) and action= (where action is none, incremental, readd, safe_use, force_use where action is specific to a hotplug when a degraded array in the domain exists and can possibly have slightly different meanings depending on whether the path specifies a whole disk device or specific partitions on a range of devices, and where there is the possibility of adding more options or a new option name for the case of adding a hotplug drive to a domain where no arrays are degraded, in which case issues such as boot sectors, partition tables, hot spare versus grow, etc. must be addressed). Modify udev rules files to cover the following scenarios (it's unfortunate that we have to split things up like this, but in order to deal with either bare drives or drives that have things like lvm data and we are using force_use, we must trigger on *all* drive hotplug events, we must trigger early, and we must override other subsystem's possible hotplug actions, otherwise the force_use option will be a noop): 1) plugging in a device that already has md raid metadata present a) if the device has metadata corresponding to one of our arrays, attempt to do normal incremental add b) if the device has metadata corresponding to one of our arrays, and the normal add failed and the options readd, safe_use, or force_use are present in the mdadm.conf file, reattempt to add using readd c) if the device has metadata corresponding to one of our arrays, and the readd failed, and the options safe_use or force_use are present, then do a regular add of the device to the array (possibly with doing a preemptive zero-superblock on the device we are adding). This should never fail. d) if the device has metadata that does not correspond to any array in the system, and there is a degraded array, and the option force_use is present, then quite possibly repartition the device to make the partitions match the degraded devices, zero any superblocks, and add the device to the arrays. BIG FAT WARNING: the force_use option will cause you to loose data if you plug in an array disk for another machine while this machine has degraded arrays. 2) plugging in a device that doesn't already have md raid metadata present but is part of an md domain a) if the device is bare and the option safe_use is present and we have degraded arrays, partition the device (if needed) and then add partitions to degraded arrays b) if the device is not bare, and the option force_use is present and we have degraded arrays, (re)partition the device (if needed) and then add partitions to degraded arrays. BIG FAT WARNING: if you enable this mode, and you hotplug say an LVM volume into your domain when you have a degraded array, kiss your LVM volume goodbye. Modify udev rules files to deal with device removal. Specifically, we need to watch for removal of devices that are part of raid arrays and if they weren't failed when they were removed, fail them, and then remove them from the array. This is necessary for readd to work. It also releases our hold on the scsi device so it can be fully released and the new device can be added back using the same device name. Modify mdadm -I mode to read the mdadm.conf file for the DOMAIN lines on hotplug events and then modify the -I behavior to suit the situation. The majority of the hotplug changes mentioned above will actually be implemented as part of mdadm -I, we will simply add a few rules to call mdadm -I in a few new situations, then allow mdadm -I (which has unlimited smarts, where as udev rules get very convoluted very quickly if you try to make them smart) to actually make the decisions and do the right thing. This means that effectively, we might just end up calling mdadm -I on every disk hot plug event whether there is md metadata or not, but only doing special things when the various conditions above are met. Modify mdadm and the spare-group concept of ARRAY lines to coordinate spare-group assignments and DOMAIN assignments. We need to know what to do in the event of a conflict between the two. My guess is that this is unlikely, but in the end, I think we need to phase out spare-group entirely in favor of domain. Since we can't have a conflict without someone adding domain lines to the config file, I would suggest that the domain assignments override spare-group assignments and we complain about the conflict. That way, even though the user obviously intended something specific with spare-group, he also must have intended something specific with domain assignments, and as the domain keyword is the newest and latest thing, honor the latest wishes and warn about it in case they misentered something. Modify mdadm/mdmon to enable spare migration between imsm containers in a domain. Retain mdadm ability to move hot spares between native arrays, but make it based on domain now instead of spare-group, and in the config settings if someone has spare-group assignments and no domain assignments, then create internal domain entries that mimic the spare-group layout so that we can modify the core spare movement code to only be concerned with domain entries. I think that covers it. Do we have a consensus on the general work? Your thoughts Neil? -- Doug Ledford <dledford@xxxxxxxxxx> GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband
Attachment:
signature.asc
Description: OpenPGP digital signature