Jeff, I have a few questions regarding the rules syntax and how they apply. I think this is different in spirit from the discussion Dan has started and keeping it separate. See questions inline. ----- Original Message ----- > One of the things holding up our data classification efforts (which include > tiering but also other stuff as well) has been the extension of the same > conceptual model from the I/O path to the configuration subsystem and > ultimately to the user experience. How does an administrator define a > tiering policy without tearing their hair out? How does s/he define a mixed > replication/erasure-coding setup without wanting to rip *our* hair out? The > included Markdown document attempts to remedy this by proposing one out of > many possible models and user interfaces. It includes examples for some of > the most common use cases, including the "replica 2.5" case we'e been > discussing recently. Constructive feedback would be greatly appreciated. > > > > # Data Classification Interface > > The data classification feature is extremely flexible, to cover use cases > from > SSD/disk tiering to rack-aware placement to security or other policies. With > this flexibility comes complexity. While this complexity does not affect the > I/O path much, it does affect both the volume-configuration subsystem and the > user interface to set placement policies. This document describes one > possible > model and user interface. > > The model we used is based on two kinds of information: brick descriptions > and > aggregation rules. Both are contained in a configuration file (format TBD) > which can be associated with a volume using a volume option. > > ## Brick Descriptions > > A brick is described by a series of simple key/value pairs. Predefined keys > include: > > * **media-type** > The underlying media type for the brick. In its simplest form this might > just be *ssd* or *disk*. More sophisticated users might use something > like > *15krpm* to represent a faster disk, or *perc-raid5* to represent a brick > backed by a RAID controller. Am I right if I understood that the value for media-type is not interpreted beyond the scope of matching rules? That is to say, we don't need/have any notion of media-types that type check internally for forming (sub)volumes using the rules specified. > > * **rack** (and/or **row**) > The physical location of the brick. Some policy rules might be set up to > spread data across more than one rack. > > User-defined keys are also allowed. For example, some users might use a > *tenant* or *security-level* tag as the basis for their placement policy. > > ## Aggregation Rules > > Aggregation rules are used to define how bricks should be combined into > subvolumes, and those potentially combined into higher-level subvolumes, and > so > on until all of the bricks are accounted for. Each aggregation rule consists > of the following parts: > > * **id** > The base name of the subvolumes the rule will create. If a rule is > applied > multiple times this will yield *id-0*, *id-1*, and so on. > > * **selector** > A "filter" for which bricks or lower-level subvolumes the rule will > aggregate. This is an expression similar to a *WHERE* clause in SQL, > using > brick/subvolume names and properties in lieu of columns. These values are > then matched against literal values or regular expressions, using the > usual > set of boolean operators to arrive at a *yes* or *no* answer to the > question > of whether this brick/subvolume is affected by this rule. > > * **group-size** (optional) > The number of original bricks/subvolumes to be combined into each produced > subvolume. The special default value zero means to collect all original > bricks or subvolumes into one final subvolume. In this case, *id* is used > directly instead of having a numeric suffix appended. Should the no. of bricks or lower-level subvolumes that match the rule be an exact multiple of group-size? > > * **type** (optional) > The type of the generated translator definition(s). Examples might > include > "AFR" to do replication, "EC" to do erasure coding, and so on. The more > general data classification task includes the definition of new > translators > to do tiering and other kinds of filtering, but those are beyond the scope > of this document. If no type is specified, cluster/dht will be used to do > random placement among its constituents. > > * **tag** and **option** (optional, repeatable) > Additional tags and/or options to be applied to each newly created > subvolume. See the "replica 2.5" example to see how this can be used. > > Since each type might have unique requirements, such as ensuring that > replication is done across machines or racks whenever possible, it is assumed > that there will be corresponding type-specific scripts or functions to do the > actual aggregation. This might even be made pluggable some day (TBD). Once > all rule-based aggregation has been done, volume options are applied > similarly > to how they are now. > > Astute readers might have noticed that it's possible for a brick to be > aggregated more than once. This is intentional. If a brick is part of > multiple aggregates, it will be automatically split into multiple bricks > internally but this will be invisible to the user. > > ## Examples > > Let's start with a simple tiering example. Here's what the > data-classification > config file might look like. > > brick host1:/brick > media-type = ssd > > brick host2:/brick > media-type = disk > > brick host3:/brick > media-type = disk > > rule tier-1 > select media-type = ssd > > rule tier-2 > select media-type = disk > > rule all > select tier-1 > # use repeated "select" to establish order > select tier-2 > type features/tiering > > This would create a DHT subvolume name *tier-2* for the bricks on *host2* and > *host3*. Then it would add a features/tiering translator to treat *tier-1* > as > its upper tier and *tier-2* as its lower. Here's a more complex example that > adds replication and erasure coding to the mix. > > # Assume 20 hosts, four fast and sixteen slow (named appropriately). > > rule tier-1 > select *fast* > group-size 2 > type cluster/afr > > rule tier-2 > # special pattern matching otherwise-unused bricks > select %{unclaimed} > group-size 8 > type cluster/ec parity=2 > # i.e. two groups, each six data plus two parity > > rule all > select tier-1 > select tier-2 > type features/tiering > In the above example we would have 2 subvolumes each containing 2 bricks that would be aggregated by rule tier-1. Lets call those subvolumes as tier-1-fast-0 and tier-fast-1. Both of these subvolumes are afr based two-way replicated subvolumes. Are these instances of tier-1-* composed using cluster/dht by the default semantics? > Lastly, here's an example of "replica 2.5" to do three-way replication for > some > files but two-way replication for the rest. > > rule two-way-parts > select * > group-size 2 > type cluster/afr > > rule two-way-pool > select two-way-parts* > tag special=no > > rule three-way-parts > # use overlapping selections to demonstrate splitting > select * > group-size 3 > type cluster/afr > > rule three-way-pool > select three-way-parts* > tag special=yes > > rule sanlock > select two-way* > select three-way* > type features/filter > # files named *.lock go in the replica-3 pool > option filter-condition-1 name:*.lock > option filter-target-1 three-way-pool > # everything else goes in the replica-2 pool > option default-subvol two-way-pool > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxxx > http://supercolony.gluster.org/mailman/listinfo/gluster-devel > _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-devel