On 04/28/2010 09:01 PM, Neil Brown wrote: > On Wed, 28 Apr 2010 17:05:58 -0400 > Doug Ledford <dledford@xxxxxxxxxx> wrote: > >> On 04/28/2010 02:34 PM, Labun, Marcin wrote: >>>>> Going further, thus causes that a new disk can be potentially grabbed >>>> by more than one container (because of shared path). >>>>> For example: >>>>> DOMAIN1: path=a path=b path=c >>>>> DOMAIN2: path=a path=d >>>>> DOMAIN3: path=d path=c >>>>> In this example disks from path c can appear in DOMAIN 1 and DOMAIN >>>> 3, but not in DOMAIN 2. >>>> >>>> What exactly is the use case for overlapping paths in different >>>> domains? >>> >>> OK, makes sense. >>> But if they are overlapped, will the config functions assign path are requested by configuration file >>> or treat it as misconfiguration? >> >> For now it merely means that the first match found is the only one that >> will ever get used. I'm not entirely sure how feasible it is to detect >> matching paths unless we are just talking about identical strings in the >> path= statement. But since the path= statement is passed to fnmatch(), >> which treats it as a file glob, it would be possible to construct two >> path statements that don't match but match the same set of files. I >> don't think we can reasonably detect this situation, so it may be that >> the answer is "the first match found is used" and have that be the >> official stance. > > I think we do need an "official stance" here. > glob is good for lots of things, but it is hard to say "everything except". > The best way to do that is to have a clear ordering with more general globs > later in the order. > path=abcd action=foo > path=abc* action=bar > path=* action=baz > > So the last line doesn't really mean "do baz on everything" but rather > "do baz on everything else". > > You could impose ordering explicitly with a priority number or a > "this domain takes precedence over that domain" tag, but I suspect > simple ordering in the config file is easiest and so best. > > An important question to ask here though is whether people will want to > generate the "domain" lines automatically and if so, how we can make it hard > for people to get that wrong. > Inserting a line in the middle of a file is probably more of a challenge than > inserting a line with a specific priority or depends-on tag. > > So before we get too much further down this path, I think it would be good to > have some concrete scenarios about how this functionality will actually be > put into effect. I'd love to just expect people to always edit mdadm.conf to > meet their specific needs, but experience shows that is naive - people will > write scripts based on imperfect understanding, then share those scripts > with others.... OK, so here's some scenarios that I've been working from that display how I envision this being used: 1) IMSM arrays: a single domain entry with a path= that specifies all (or some) ports on the controller in question and encompasses all the containers on that controller. The action would be spare or grow and what container we add any new drives to would depend on the various container's conditions and types. 2) IMSM arrays + native arrays: similar to above but split the ports between IMSM use and native use. No overlapping paths, some paths to one, some paths to other. 3) native arrays: one or more domains, no overlapping paths, actions dependent on domain. As an example of this type of setup, let me detail how I commonly configure my machines when I have multiple drives I want in multiple arrays. Let's assume a machine with 6 drives (like the one I'm using right now). I use a raid5 for my / partition, so I can tolerate at most 1 drive failure on my machine before it's unusable. So, my standard partitioning method is to create a 1gig partition on sda and sdb and make those into a raid1 /boot partition. Then I do 1gig on all remaining drives and make that into a raid5 swap partition. Then I do the remaining space on all drives as a raid5 root partition. I don't do any more than two drives in the raid1 /boot partition because if I ever loose two drives and can't use the machine anyway, so more than that in the /boot partition is a waste. So, in my machine's case, here are the domain entries I would create: DOMAIN path=blah[123456] action=force-partition table=/etc/mdadm.table program=sfdisk DOMAIN path=blah[12]-part1 action=force-spare DOMAIN path=blah[3456]-part1 action=force-grow DOMAIN path=blah*-part2 action=force-grow Assuming that blah in the above is the path to my PCI sata controller, the first entry would tell mdadm that if a bare disk is inserted into slots 1 through 6, then force the disk to have the correct partition table for my usage (Dan, I think this should clear up the confusion about the partition action you had in another email, but I'll address it there to...partition is really only for native array types, IMSM will never use it). The second entry says if it's sda or sdb and it's the first partition (so sda1 or sdb1), then force it to be added as a spare to any arrays in the domain. Because of how the arrays_in_domain function works, this will only ever match the raid1 /boot array, so we know for a fact that it will always get added to the raid1 /boot array. And because that array only exists on sda1 and sdb1 anyway, we know that if we ever plug a drive into either of those slots, then the array will already be degraded, and this spare will be used to bring the array back into good condition. The third domain says on the remaining ports to take the first partition and grow (if possible, spare if the array is degraded) any existing array. This means that my raid5 swap partition will either get repaired, or grown, depending on the situation. The final entry makes it so that the second partition on any disk inserted is used to grow (or spare if degraded) the / partition. One of the things that the current code relies upon is something that we talked about earlier. For native array types, we only allow identical partition tables. We don't try to do things like add /dev/sdd4 to an array comprised of entries such as /dev/sdc3. Finding a suitable partition when partition tables are not identical is beyond the initial version of this code. Because of this requirement, the arrays_in_domain function makes use of this to narrow down arrays that might match a domain based upon partition number. So if the current pathname include part? in its path, the function only returns arrays with the same part in their path. That considerably eases the matching process. > >> >>> So, do you plan to make changes similar to incremental in assembly to serve DOMAIN? >> >> I had not planned on it, no. The reason being that assembly isn't used >> for hotplug. I guess I could see a use case for this though in that if >> you called mdadm -As then maybe we should consult the DOMAIN entries to >> see if there are free drives inside of a DOMAIN listed as spare or grow >> and whether or not we have any degraded arrays while assembling that >> could use the drives. Dunno if we want to do that though. However, I >> think I would prefer to get the incremental side of things working >> first, then go there. >> >>> Should an array be split (not assembled) if a domain paths are dividing array into two separate DOMAIN? >> >> I don't think so. Amongst other things, this would make it possible to >> render a machine unbootable if you had a type in a domain path. I think >> I would prefer to allow established arrays to assemble regardless of >> domain path entries. >> >>>> I'm happy to rework the code to support it if there's a valid use >>>> case, but so far my design goal has been to have a path only appear in >>>> one domain, and to then perform the appropriate action based upon that >>>> domain. >>> What is then the purpose of metadata keyword? >> >> Mainly as a hint that a given domain uses a specific type of metadata. I want to address this in a bit more detail. One of the conceptual problems I've been wrestling with in my mind if not on paper yet is the problem of telling a drive that is intended to be wiped out and reused from a drive that is part of your desired working set. Let's think about my above example for native arrays, where there are three arrays, a /boot, a swap, and a / array. Much of this talk has centered around "what do we do when we get a hotplug event for a drive and array <blah> is degraded". That's the easy case. The hard case is "what do we do if array <blah> is degraded and the user shuts the machine down, puts in a new-to-this-machine drive (possibly with existing md raid superblocks), and then boots the machine back up and expects us to do the right thing". For anyone that doesn't have true hotplug hardware, this is going to be the common case. If the drive is installed in the last place in the system and it's the last drive we detect, then we have a chance of doing the right thing. But if it's installed to replace /dev/sda, we are *screwed*. It will be the first drive we detect. And we won't know *what* to do with it. And if it has a superblock on it, we won't even know that it's not supposed to be here. We will happily attempt incremental assembly on this drive, possibly starting arrays that have never existed on this machine before. So, I'm actually finding the metadata keyword less useful than possibly adding a UUID keyword and allowing a domain to be restricted to one or more UUIDs. Then if we find an errant UUID in the domain, we know not to assemble it and in fact if the force-spare or force-grow keywords are present we know to wipe it out and use it for our own purposes. However, that doesn't solve the whole problem that if it's /dev/sda then we won't have any other arrays assembled yet, so the second thing we are going to have to do is defer our use of the drive until a later time. Specifically I'm thinking we might have to write a map entry for the drive into the mapfile, then when we run mdadm -IRs (because all distros do this after scsi_wait_scan has completed...right?) we can revisit what to do with the drive. The other option is to add the drive to the mapfile, then when mdadm --monitor mode is started have it process the drive because all of our arrays should be up and running by the time we start the monitor process. Those are the only two solutions I have to this issue at the moment. Thoughts welcome. >>> My initial plan was to create a default configuration for a specific metadata, where user specifies actions >>> but without paths letting metadata handler to use default ones. >>> In your description, I can see that the path are required. >> >> Yes. We already have a default action for all paths: incremental. This >> is the same as how things work today without any new support. And when >> you combine incremental with the AUTO keyword in mdadm.conf, you can >> control which devices are auto assembled on a metadata by metadata basis >> without the use of DOMAINs. > > >> The only purpose of a domain then is to >> specify an action other than incremental for devices plugged into a >> given domain. > > I like this statement. It is simple and to the point and seems to capture > the key ideas. > > The question is: is it true? :-) Well, for the initial implementation I would say it's true ;-) Certainly all the other things you bring up here make my brain hurt. > It is suggested that 'domain' also involved in spare-groups and could be used > to warn against, or disable, a 'create' or 'add' which violated policy. > > So maybe: > The purpose of a domain is to guide: > - 'incremental' by specifying actions for hot-plug devices other than the > default Yes. > - 'create' and 'add' by identifying configurations that breach policy We don't really need domains for this. The only things that have hard policy requirements are BIOS based arrays, and that's metadata/platform specific. We could simply test for and warn on create/add operations that violate platform capability without regard to domains. > - 'monitor' by providing an alternate way of specifying spare-groups Although this can be done, it need not be done. I'm still not entirely convinced of the value of the spare-group tag on domain lines. > It is a lot more wordy, but still seems useful. > > While 'incremental' would not benefit from overlapping domains (as each > hotplugged device only wants one action), the other two might. > > Suppose I want to configure array A to use only a certain set of drives, > and array B that can use any drive at all. Then if we disallow overlapping > domains, there is no domain that describes the drives that B can be made from. > > Does that matter? Is it too hypothetical a situation? Let's see if we can construct such a situation. Let's assume that we are talking about IMSM based arrays. Let's assume we have a SAS controller and we have more than 6 ports available (may or may not be possible, I don't know, but for the sake of argument we need it). Let's next assume we have a 3 disk raid5 on ports 0, 1, and 2. And let's assume we have a 3 disk raid5 on ports 4, 5, and 6. Let's then assume we only want the first raid5 to be allowed to use ports 0 through 4, and that the second raid5 is allowed to use ports 0 through 7. To create that config, we create the two following DOMAIN lines: DOMAIN path=blah[01234] action=grow DOMAIN path=blah[01234567] action=grow Now let's assume that we plug a disk into port 3. What happens? Currently, conf_get_domain() will return one, and only one, domain for a given device. And it doesn't search for best match (which would be very difficult to do as we use fnmatch to test the glob match, which means that really the path= statement is more or less opaque to us, we don't process it ourselves and don't evaluate it ourselves, we just pass it off to fnmatch and let it tell us if things matched), it just finds the first match and returns it. So, right now anyway, we will match the first domain and the first domain only. That means we will then return that domain, then later when we call arrays_in_domain we will pass in our device path plus our matched domain and as a result we will search mdstat and we will find both raid5 arrays in our requested domain (the current search returns any array with at least one member in the domain, maybe that should be any array where all members are in the domain). Now, at this point, if one or the other array is degraded, then what to do is obvious. However, if both arrays are degraded or neither array is degraded, then our choice is not obvious. I'm having a hard time coming up with a good answer to that issue. It's not clear which array we should grow if both are clean, nor which one takes priority if both are degraded. We would have to add a new tag, maybe priority=, to the ARRAY lines in order to make this decision obvious. Short of that, the proper course of action is probably to do nothing and let the user sort it out. Now let's assume that we plug a disk into port 7. We search and find the second domain. Then we call arrays_in_domain() and we get both raid5 arrays again because both of them have members in the domain. Regardless of anything else, it's clear that this situation did *not* do what you wanted. It did not specify that array 1 can only be on the first 5 ports, and it did not specify that array 2 can use all 8 ports. If we changed the second domain path to be blah[567] then it would work, but I don't think that this combination of domains and the resulting actions is all that clear to understand from a user's perspective. I think right now trying to do what you are suggesting is confusing from a domain line. Maybe we need to add something to array lines for this. Maybe the array line needs an allowed_path entry that could be used to limit what paths an array will accept devices from. But this then assumes we will create an array line for all arrays (or for ones where we want to limit their paths) and I'm not sure people will do (or want to do) that. So, while I can see a possible scenario that matches your hypothetical, I'm finding that the domain construct is a very clunky way to try and implement the constraints of your hypothetical. > Here is another interesting question. Suppose I have two drive chassis, each > connected to the host by a fibre. When I create arrays from all these drives, > I want them to be balanced across the two chassis, both for performance > reasons and for redundancy reasons. > Is there any way we can tell mdadm about this, possible through 'domains'. This is actually the first thing that makes me see the use of spare-group on a domain line. We could construct two different domains, one for each chassis, but with the same spare-group tag. This would imply that both domains are available as spares to the same arrays, but allows us to then add a policy to mdadm for how to select spares from domains. We could add a priority tag to the domain lines. If two domains share the same spare-group tag, and the domains have the same priority, then we could round-robin allocate from domains (what you are asking about), but if they have different priorities then we could allocate solely from the higher (or lower, implementation defined) priority domain until there is nothing left to allocate from it and then switch to the other domain. I could actually also see adding a write_mostly flag to an entire domain in case the chassis that domain represents is remote via wan. > This could be an issue when building a RAID10 (alternate across the chassis > is best) or when finding a spare for a RAID1 (choosing from the 'other' > chassis is best). > > I don't really want to solve this now, but I do want to be sure that our > concept of 'domain' is big enough that we will be able to fit that sort of > thing into it one day. > > Maybe a 'domain' is simply a mechanism to add tags to devices, and possibly > by implication to arrays that contain those devices. > The mechanism for resolving when multiple domains add conflicting tags to > the one device would be dependant on the tag. Maybe first-wins. Maybe > all are combined. > > So we add an 'action' tag for --incremental, and the first wins (maybe) > We add a 'sparegroup' tag for --monitor > We add some other tag for balancing (share=1/2, share=2/2 ???) > > I'm not sure how this fits with imposing platform constraints. > As platform constraints are closely tied to metadata types, it might be OK > to have a metadata-specific tags (imsm=???) and leave to details to the > metadata handler??? I'm more and more of the mind that we need to leave platform constraints out of the domain issue and instead just implement proper platform constraint checks and overrides in the various parts of mdadm that need it regardless of domains. > Dan: help me understand these platform constraints: what is the most complex > constraint that you can think of that you might want to impose? > > NeilBrown -- Doug Ledford <dledford@xxxxxxxxxx> GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband
Attachment:
signature.asc
Description: OpenPGP digital signature