On Tue, 2004-10-05 at 11:03, Daniel Phillips wrote: > Hi all, > > The time has arrived to connect the cluster snapshot block device to the > cluster infrastructure, so that failover will work as God intended. Ben > and I have been pondering just how to go about this, using the various > bits and pieces available, and perhaps evolving some more bits as we > go. > > The cluster snapshot devices interfaces to a user space daemon called > csnap-agent, whose main job is to receive server connection requests > and deliver server connections to the driver. The plan was, we will > customize the csnap agent as necessary to interface to the cluster > infrastructure. So, how, exactly? > > The idea is, there is a service manager out there somewhere that keeps > track of how many instances of a service of a given type currently > exist, and has some way of creating new resource instances if needed, > or killing off extra ones. In our case, we want exactly one instance > of a csnap server. We need not only to specify that constraint somehow > and get a connection to it, but we need to supply a method of starting > a csnap server. So csnap-agent will be a client of service manager and > an agent of resource manager. Why do you need a service manager for this? As Lon suggested, a DLM lock can provided the 1 master and the others ready to take over when the lock is released. > We won't talk to either service manager or resource manager directly, > but go through Lon's Magma library, which is supposed to provide a nice > stable api for us to work with, regardless of whether particular > services reside in kernel or user space, or are local or remote. Lon > has said that he will adapt the Magma api as we go, if we break > anything or run into limitations. (I suppose that is why it is called > Magma, it flows under pressure.) > Why do we want to use Magma? At the cluster summit I thought that Magma was just the way to provide backward compatibility for the older GFS releases. Did we agree to make magma the API? Having csnap depend on the DLM API makes more sense to me. > Magma receives requests by direct library calls and supplies answers > either via function returns or via events delivered over a socket > connection, which seems to be a pretty good fit with the way csnap does > things. So now, what are we going to ask it, and how is it going to > answer? > > 1. Request a snapshot server host:port name, creating an instance > if necessary > > 2. Register to act as an agent to start a snapshot server instance > > My instinct is that we do not want 1. to be a blocking call into Magma, > that returns only when it has a server instance, because we may want > our agent to be able to service other events while it waits for its > server address. So the likely interface is to call magma, saying what > kind of server we want, and wait for the address to arrive as an event. > > Magma doesn't actually know anything about what we're asking it, it only > knows how to pass on requests to somebody who does. So we're actually > talking to service manager and resource manager through Magma, and > presumably they talk to each other as well, because service manager > must ask resource manager to create or kill off resource instances on > its behalf. What would need to be killed off? Under what circumstances? > > Anyway, csnap-agent is mainly going to be talking to service manager > through Magma, but it also needs to tell resource manager about our > resource, its constraints and how to set itself up as an agent to > create it. I don't have a clear picture of how this works at the > moment, and that is the point of this email. > > For example, how do we specify the service manager constraints, i.e., > "exactly one" in this case: before we request the instance, or as part > of the request, or in a configuration file somewhere? > The cnap-agent to csnap-server seems like a perfect example of why we a cluster communication API. The csnap-agent wants to send information to the csnap-server and could use a highly available communication mechanism. Daniel