I've been thinking and experimenting around some of the things we need in this area to support 4.0 features, especially data classification http://www.gluster.org/community/documentation/index.php/Features/data-classification Before I suggest anything, a little background on how brick and volume management *currently* works. (1) Users give us bricks, specified by host:path pairs. (2) We assign each brick a unique ID, and create a "hidden" directory structure in .glusterfs to support our needs. (3) When bricks are combined into a volume, we create a bunch of volfiles. (4) There is one volfile per brick, consisting of a linear "stack" of translators from storage/posix (which interacts with the local file system) up to protocol/server (which listens for connections from clients). (5) When the volume is started, we start one glusterfsd process for each brick volfile. (6) There is also a more tree-like volfile for clients, constructed as follows: (6a) We start with a protocol/client translator for each brick. (6b) We combine bricks into N-way sets using AFR, EC, etc. (6c) We combine those sets using DHT. (6d) We push a bunch of (mostly performance-related) translators on top. (7) When a volume is mounted, we fetch the volume and instantiate all of the translators described there, plus mount/fuse to handle the local file system interface. For GFAPI it's the same except for mount/fuse. (8) There are also volfiles for NFS, self-heal daemons, quota daemons, snapshots, etc. I'm going to ignore those for now. The code for all of this is in glusterd-volgen.c, but I don't recommend looking at it for long because it's one of the ugliest hairballs I've ever seen. In fact, you'd be hard pressed to recognize the above sequence of steps in that code. Pieces that belong together are splattered all over. Pieces that should remain separate are mashed together. Pieces that should use common code use copied code instead. As a prerequisite for adding new functionality, what's already there needs to be heavily refactored so it makes some sense. So . . . about that new functionality. The core idea of data classification is to apply step 6c repeatedly, with variants of DHT that do tiering or various other kinds of intelligent placement instead of the hash-based random placement we do now. "NUFA" and "switch" are already examples of this. In fact, their needs drove some of the code structure that makes data classification (DC) possible. The trickiest question with DC has always been how the user specifies these complex placement policies, which we then turn into volfiles. In the interests of maximizing compatibility with existing scripts and user habits, what I propose is that we do this by allowing the user to combine existing volumes into a new higher-level volume. This is similar to how the tiering prototype already works, except that "combining" volumes is more general than "attaching" a cache volume in that specific context. There are also some other changes we should make to do this right. (A) Each volume has an explicit flag indicating whether it is a "primary" volume to be mounted etc. directly by users or a "secondary" volume incorporated into another. (B) Each volume has a graph representing steps 6a through 6c above (i.e. up to DHT). Only primary volumes have a (second) graph representing 6d and 7 as well. (C) The graph/volfile for a primary volume might contain references to secondary volumes. These references are resolved at the same time that 6d and 7 are applied, yielding a complete graph without references. (D) Secondary volumes may not be started and stopped by the user. Instead, a secondary volume is automatically started or stopped along with its primary. (E) The user must specify an explicit option to see the status of secondary volumes. Without this option, secondary volumes are hidden and status for their constituent bricks will be shown as though they were (directly) part of the corresponding primary volume. As it turns out, most of the "extra" volfiles in step 8 above also have their own steps 6d and 7, so implementing step C will probably make those paths simpler as well. The one big remaining question is how this will work in terms of detecting and responding to volume configuration changes. Currently we treat each volfile as a completely independent entity, and just compare whole graphs. Instead, what we need to do is track dependencies between graphs (a graph of graphs?) so that a change to a secondary volume will "ripple up" to its primary where a new graph can be generated and compared to its predecessor. Any other thoughts/suggestions? _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-devel