On Monday 11 October 2004 19:08, Daniel McNeil wrote: > On Fri, 2004-10-08 at 17:49, Daniel Phillips wrote: > > - You will be faced with the task of coding every possible > > resource metric into some form of locking discipline. > > > > - Your resource metrics are step functions, the number of steps > > being the number of locking layers you lather on. Real > > resource metrics are more analog than that. > > > > - You haven't done anything to address the inherent raciness of > > giving the lock to the first node to grab it. Chances are good > > you'll always be giving it to the same node. > > I do not think of these as "problems". They are problems if you claim you are doing resource management, as opposed to just being random (and nonuniformly random at that). That said, at the moment we just want something up and running; niceties can come later. There's an algorithm down below, based on grabbing a lock. > You never answered, How would a resource manager know to pick the > "best" choice? That depends on how it is told to pick, either by pre-ordained configuration, or automagic balancing algorithms, or a combination of the two. > The cluster is made up of software components (see pretty picture > attached). IMHO, it would be good to follow some simple rules: > > 1. Components higher on the stack should only depend on > components lower on the stack. Let's avoid circular > dependencies. > > 2. When possible, use "standard" components and APIs. > We have agreed that some common components: > > DLM > cluster membership and quorum Motherhood. > cluster communications (sort of) There's no standard component or api, and nobody has proved one is needed. (If somebody implements one and lots of things get smaller, faster and more reliable, that is proof.) > AFAICT, resource management is higher up the stack and having shared > storage like cluster snapshot depend on it, would cause circular > dependencies. Not only that, but after a read-through, rgmanager is not suited to low level use as currently conceived. Just one of many problems: we don't want to be parsing XML in a block device failover path. So I will stop bothering Lon about making this be what it's not. > SM, is a Sistina/Redhat specific thing. Might be wonderful, but it > is not common. David's email leads me to believe it is not the right > component to interface with. It's too large a hammer with which to hit this flea. > So, what is currently implemented that we have to work with? > Membership and DLM. These are core services and see to be > pretty solid right now. > > So how can we use these? Seems fairly simple: > > 1st implementation: > =================== > > Add single DLM lock in csnap server. > When a snap shot target is started, start up a csnap server. > If the csnap server gets the lock, he is master. > In normal operation, the csnap server is up and running > on all nodes. One node has the DLM lock and the others > are ready to go, but waiting for the DLM lock to convert. > On failure, the next node to get the lock is master. Something like that. Actually, the job is done by the csnap agent, which is also responsible for handing server connections to csnap clients on the node. (Whether the csnap server eventually becomes part of the csnap agent is another question.) The agent does: Repeat as necessary: - get a Concurrent Read lock - if there's a server address in the lvb we're done - otherwise, convert to Protected Write without waiting - if we got it, write our server address to the lvb, we're done This is driven by one or more csnap clients on the node noticing a broken server connection. We also need to arrange for the csnap server to give up the PW lock if its node leaves the cluster. The agent better subscribe for some sort of cluster management event here, except there isn't any such event except when gdlm is already dead, which isn't much use. This is a big fat deficiency. > If administrator knows which machines is "best", have him > start the snapshot targets on that machine 1st. Not perfect, > but simple and provides high availability. > > It is also possible for the csnap server to put its > server address and port information in the LVB. > > This seems simple, workable, and easy to program. And maps easily to either gdlm or gulm, though somebody would have to write a userland interface to make this transparent. (Lon?). I haven't thought of any way to be lazier. The other main alternative I looked into is to use a multicast message to request a server and a multicast message to announce a server, which is the sort of thing gdlm does internally anyway, but probably uses fewer messages and doesn't futz with LVBs. However, an exclusive lock is still needed to protect the snapshot soup from too many cooks. If we look closely at what's involved in getting that exclusive lock, we notice that the lock master will typically be the first node in the membership list, so why not just assign the agent on that node to anoint new servers and hand out server addresses? Then we only have to worry about membership races, an interesting topic in itself. This one goes on the back burner for when I have too much time on my hands. > Questions: > > I do understand what you mean by inherent raciness. > Once a cluster is up and running, the first csnap server > starts up. It does not stop until it dies, which I assume > is rare. What raciness are you talking about? Failover: with the "grab a lock" node selection method, whoever is first to notice the old server died will probably end up starting the new one. This doesn't qualify as resource management, it does however keep the cluster alive. > How complicated of a resource metric were you thinking about? User defined, where one of the things the user can say is "automagic". A simple priority scheme would let the user assign a priority number for each node, and the resource manager picks the node with the highest priority (there is no point in distributing this algorithm). An improved resource manager would collect load statistics to adjust the priority numbers. Any priority adjustment would be done outside the failover path, so we would not need to worry about auditing that code for bounded memory use. For now we will put aside grand designs and go with a crude method on the theory that "snapshot server node choice is too random" will not make the top ten list of things that suck most about our cluster any time soon. > I have read through design doc and still thinking about client > reconnect. Are you planning on implementing the 4 message > snapshot read protocol? Yes, it's easy to do and it guarantees fast failover. It does however double the network latency of a snapshot (versus origin) read request, which is very visible on some loads. This will end up as a per-client option I think. > There must be some internal cluster communication mechanisms > for membership (cman) and DLM to work. Is there some reason why > these are not suitable for snapshot client to server > communication? Csnap will happily use any SOCK_STREAM socket, its interface is just read/write and shutdown. Cman/gdlm's messaging scheme is considerably fancier, for some reason that isn't yet clear to me. For csnap, it would just be a source of extra task switches and other overhead, and more code. Regards, Daniel