Hi all, I figured it was time to make a firm decision on what "domains" and related things would look like in mdadm. In all the discussions so far I have just been making suggestions and exploring possibilities and wandering around the edges of the issue. But that cannot last forever as there is need for some certainty. I had a read through Doug's patch set and Przemyslaw's and Anna's work on top of that and there were certain aspects of what I saw that I didn't like. In particular the model of what a 'domain' was seems to keep changing, first growing special cases for partitions, and then growing subsets (which I admit I didn't completely understand). When something grows and changes like that so quickly there is a very real possibility that the final result won't meet the original needs any more. I think we need to start with something that is *right* - at least as far as it goes. Refinements that are predictable are ok, but structural changes aren't. So here is my concrete proposal on how these things will work. I have already started implementing it, which shows that I'm fairly committed to this and would need a very strong argument for significant change to happen. The first step is to forget about domains. We will come back to them later as they are important and useful. But they are not central and we won't be starting there. So forget them. (Forget what? I don't remember anything...) What we need is a policy framework, for encoding policy about the various automatic actions that mdadm performs. We already have bits of policy like the spare-group tag (which guides automatic spare migration) and the 'auto' mdadm.conf line (which guides automatic assembly). However that is all ad-hoc and as the amount of policy increases, the amount of interaction increases so we need a unifying platform. That is where we need to start. So point 1 is that we need a policy framework. Point 2 is that policy revolves primarily around devices (rather than arrays) and to a lesser extent around metadata types. It is devices that are migrated, devices that arrays are built from, devices that are automatically made into spares etc. Metadata types often encode some specific policy in the metadata, so they need some fairly strong role in the policy framework too. Often the metadata type is like a parameter to a policy. "You can incorporate this device in any imsm array". So Abstraction 1 is a "Policy statement". A policy statement applies to a particular device, possibly in the context of a particular metadata, and asserts that a particular name has a particular value. action=spare (ddf1) might be a policy statement about a device. It says that where ddf1 metadata is involved, the device can be made a hot-spare when it is hot-plugged. auto=homehost (0.90) might be another which says that auto-assembly may use a non-disambiguated name (no trailing _NN) when assembling this device into a metadata=0.90 array providing the homehost information in the metadata matches this host. A statement might not have any metadata type associated. action=ignore applies irrespective of metadata type. The policy names that I currently envisage are: action= ignore, include, spare, force-spare which covers the hotplug actions that --incremental might perform. auto= yes, homehost, no which covers the functionality currently in the AUTO mdadm.conf line domain= arbitrary-string This provides the 'domain' isolation functionality. The semantics I have in mind (and the precise details here are fairly important so this cannot be changed lightly) are: A device can have a number of domains, possibly from various sources. An array can have a number of domains, from the devices plus from spare-group A device may be attached to an array if all of the domains of the device are also domains of the array. The array may have extra domains. The device may not. This requires that if there are overlapping domains, they must properly nest. i.e. the intersection of two domains must be empty, or one of the domains. It might make sense to have a domain 'global' which all devices have, and some other domains which just subsets have. There is probably room for other policies like whether to start an incrementally assembled degraded array early, or wait until it is not degraded. Maybe some policy of handling "prodigal device" situations where two halfs of a mirror both this they are "it" and the other is "not". By now Doug (hope your back is feeling better) will have noticed that partitions haven't been mentioned yet. So it is time for them. Point 3: partitions become a new metadata type (or types). If we want mdadm to ensure there is a MBR partition table on a device, then provide a policy statement like action=spare (mbr) so if the device doesn't have recognised metadata, mdadm configures it as a spare of type mdr, getting the table from some compatible pre-existing device. There is probably room to refine this to get the table from a file like Doug's patches aimed to. That wouldn't be my first preference as it requires extra configuration, but it might be necessary. That would require adding some sort of argument to each policy statement, they become name = value (metadata) other-arguments I'd rather keep that to a very minimum though. Note that the above syntax is all abstract syntax. It reflects the internal data structures, but not necessarily the way that policy will be expressed to mdadm. For that we need to start with some concrete syntax for mdadm.conf So: Point 4: policy is specified in mdadm.conf by "POLICY" lines (aka policy rules) A policy line contains match words, assignment words, and metadata words. match words are name=value or possibly name==value - haven't decided yet. assignment words are name=value (or name:=value ... probably not) metadata words are "metadata=foo" A device matches a policy line if, for each match name that appears, the device matches at least one of the values. So if we have POLICY a==1 a==2 b==3 b==4 then for a device to match it must have an 'a' or 1 or 2, and a 'b' of 3 or 4, but it doesn't matter what the device has for 'c'. One device may match multiple POLICY lines and if it does so, it accumulates all the assigned words. The ordering of policy lines is irrelevant to the end result. For this to work we might need to add a "word!=value" - I hope not, but it wouldn't be a big problem. If a device matches a policy line then a separate policy statement is created combining each assignment word with each metadata word (if there are any). This list of policy statements is added to the device's policy. Sometimes policy is very metadata dependent so: Point 5: policy can be specified by the metadata handler too. If a device is found to have metadata on it, then when that metadata is loaded (->load_super()) it might add some policy statements to the device. If it does they will all be in the context of the relevant metadata type. This will probably include 'domain' assignments to restrict spare migration. But wait, there's more Point 6: We probably have platform policy too. I'm not really sure what this will involve, and what if anything needs to be explicit. Maybe just platform-policy imsm in mdadm.conf tell mdadm to query the platform and deduce some policy statements or police rules. There is a strong pattern that when a set of devices is partitioned, all the '1' partitions go in one array, all the '2' partitions in another etc. It might be useful to have config-file support for this pattern, so a possible config file line would be: partition-policy path=foo domain=bar which effectively makes multiple policy lines each of which has '-partNN' added to all 'path' values and all 'domain' values. But I'm getting ahead of myself... The 'match' names that I imagine are: path= which is given a 'glob' pattern to match against the path name from /dev/disk/by-path/ type= which is either 'disk' or 'partition' We could also have size= which uses the standardised disk sizes so it would be easy to say that all 2GB devices only migrate to arrays with 2GB devices in them. So: given a device we extract a bunch of policy statements from various sources. Now we need to know how to apply those policy statements in different situations. There are various contexts where we need to review policy. A/ When considering adding a device to an array. This can happen at hot-plug either because the device looked like a member of the array, or because the device is being added as a new spare. The primary policy information here is 'domain'. We extract a list of domains that the device is in which are specific to the metadata of the array (or are not metadata-specific) We also get a list of domains for the array by extracting a similar list for each device and including any spare-group from mdadm.conf Then we check if the set of domains for the device is a subset of the set of domains for the array. If it is (and is non-empty), the addition is allowed. If it isn't then the addition probably isn't allow, though we might invent some other policy like "no-strict-domains", or assert that domains don't apply when the user explicitly makes a request. or uses --force. Or something. This might have some variation depending on whether the 'add this to an array' came from --create or --assemble or --add or --re-add or --incremental or --monitor doing spare migration. My point at the moment isn't to give the entire algorithm but the show how the policy framework would inform that algorithm. B/ when considering what to do with a device that has been passed to --incremental. For this we need to 1/ identify an array, and hence a metadata type 2/ find the 'action' policy for the device with that metadata type. 3/ if there are more than one, fail 4/ if the one is 'ignore' do nothing 5/ if 'A' above says we cannot add this device, then give up 6/ consider which of 'include', 'spare', 'force-spare' might apply here..... If the device has recognisable metadata, which identifies an array, then the array identified in step 1 is just that array. If the device does not have recognisable metadata, then we consider each array in turn (though we might optimise out some easy cases like if all metadatas say 'ignore' then don't bother listing arrays). If multiple arrays all allow the device to be added, we would need to chose the first which is degraded (unless we invented some other policy). So this is how I want these things to work, and this is what I'm going to be coding. I should have the basic framework in place early next week (assuming no major interruptions) at which point I'll make the code available. The part of this that I'm least confident of is assigning domains to arrays. Extracting a list of policy statements for each device sounds a bit cumbersome. Maybe if I cache enough bits of it, it will work nicely. Comments, as always, are most welcome. Thanks, NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html