At the cluster summit most people seedm to agree that we needed a generic, pluggable kernel API for cluster functions. Well, I've finally got round to doing something. The attached spec allows for plug-in cluster modules with the possibility of a node being a member of multiple clusters if the cluster managers allow it. I've seperated out the functions of "cluster manager" so they can be provided by different components if necessary. Two things that are not complete (or even started) in here are a communications API and a locking API. For the first, I'd like to leave that to those more qualified than me to do and for the second I'd like to (less modestly) propose our existing DLM API with the argument that it is a full-featured API that others can implement parts of if necessary. Comments please. -- patrick
CONCEPTS -------- The kernel holds a list of named cluster management modules which register themselves at insmod time. Each of these may provide one or more groups of services: "comms", "membership" and "quorum". In theory a node may be a member of many clusters, though some cluster managers may prevent this. The kernel APIs presented here are meant to be simple enough to be tidy, but featureful enough to implement SAF on top in userspace. I don't think it is appropriate to implement the full SAF specification in kernel space. Membership ops -------------- struct membership_node_address { int32_t mna_len; char mna_address[MAX_ADDR_LEN]; }; struct membership_node { int32_t mn_nodeid; struct membership_node_address mn_address char mn_name[MAX_NAME_LEN]; uint32_t mn_member; struct timeval mn_boottime; }; struct membership_notify_info { void * mni_context; uint32_t mni_viewnumber; uint32_t mni_numitems; uint32_t mni_nummembers; char * mni_buffer; }; struct membership_ops { int (start_notify) (void *cmprivate, void *context, uint32_t flags, membership_callback_routine *callback, char *buffer, int max_items); #define MEMBERSHIP_FLAGS_NOTIFY_CHANGES 1 /* Notify of membership changes */ #define MEMBERSHIP_FLAGS_NOTIFY_NODES 2 /* Send me a full node list now */ int (notify_stop) (void *cmprivate); int (get_name) (void *cmprivate, char *name, int maxlen); int (get_node) (void *cmprivate, int32_t nodeid, struct membership_node *node); #define MEMBERSHIP_NODE_THISNODE -1 /* Get info about local node */ }; /* This is what is called by membership services as a callback */ typedef int (membership_callback_routine) (void *context, uint32_t reason); I've made node IDs a signed int32, this allows for a negative pseudo ID for "this node". cman uses 0 for "this node" but other membership APIs may allow a real node to have an ID of zero. SAF uses a "this node" pseudo ID. Quorum ops ---------- /* These might be a bit too specific... */ struct quorum_info { uint32_t qi_total_votes; uint32_t qi_expected_votes; uint32_t qi_quorum; }; struct quorum_ops { int (get_quorate) (void *cmprivate); int (get_votes) (void *cmprivate, int32_t nodeid); int (get_info) (void *cmprivate, struct quorum_info *info); }; Bottom interface. ----------------- /* When a CM module is loaded it calls cm_register() * which adds its proto_name/ops pair to a global list. */ int cm_register(struct cm_ops *proto); void cm_unregister(struct cm_ops *proto); /* A CM sets up one of these structs with the functions it can provide and * registers it, along with its name (type) using cm_register() */ struct cm_ops { char co_proto_name[256]; /* These are required */ int (*co_attach) (struct cm_info *info); int (*co_detach) (void *cmprivate); /* These are optional, a CM may provide some or all */ struct cm_comm_ops *co_cops; struct cm_member_ops *co_mops; struct cm_quorum_ops *co_qops; } Others ------ I've omitted the comms interface because I'm not really sure how featured this really out to be. We may want to add a locking interface in here too? Top interface ------------- /* When cm_attach() is called, the "harness" searches the * global list of registered CM's, looking for one with the given * proto_name. If one is found, its co_attach() function is called, being * passed the cm_attach() parameters. */ int cm_attach(char *proto_name, char *cluster_name, struct cm_info *info); void cm_detach(void *cmprivate); /* When a CM's attach function is called, it fills in the cm_info struct * provided by the caller with its own ops functions and values. This * includes its private data pointer to be used with its ops functions. */ struct cm_info { struct cm_ops *ops; void *cmprivate; } eg -- Say "foo" is a "low level" system and provides select comms and member functions. 1. it sets foo_ops cm_proto_name = "foo"; cm_attach = foo_attach; cm_detach = foo_detach; cm_cops = foo_cops; cm_mops = foo_mops; cm_qops = NULL; 2. and calls cm_register(&foo_ops); Say "bar" is a higher level system and provides select member and quorum functions. 1. it sets bar_ops cm_proto_name = "bar"; cm_attach = bar_attach; cm_detach = bar_detach; cm_cops = NULL; cm_mops = bar_mops; cm_qops = bar_qops; 2. and calls cm_register(&bar_ops); Internally, bar could attach to foo and use the functions foo provides. Bar may provide some member_ops functions that foo doesn't, in addition to some quorum services, none of which foo provides. Applications may attach to just bar, just foo, or in some cases both foo and bar. bar could be programmed to use foo statically (like lock_dlm is programmed to use dlm and cman, but gfs can use either lock_dlm or lock_gulm). bar could also take the lower level type (foo) as an input parameter in some way, making it dynamic.