Re: [Linux-cluster] Interfacing csnap to cluster stack

Daniel Phillips <phillips@xxxxxxxxxx> · Tue, 12 Oct 2004 01:44:00 -0400

On Monday 11 October 2004 19:08, Daniel McNeil wrote:
> On Fri, 2004-10-08 at 17:49, Daniel Phillips wrote:
> >   - You will be faced with the task of coding every possible
> >     resource metric into some form of locking discipline.
> >
> >   - Your resource metrics are step functions, the number of steps
> >     being the number of locking layers you lather on.  Real
> >     resource metrics are more analog than that.
> >
> >   - You haven't done anything to address the inherent raciness of
> >     giving the lock to the first node to grab it.  Chances are good
> >     you'll always be giving it to the same node.
>
> I do not think of these as "problems".

They are problems if you claim you are doing resource management, as 
opposed to just being random (and nonuniformly random at that).

That said, at the moment we just want something up and running; niceties 
can come later.  There's an algorithm down below, based on grabbing a 
lock.

> You never answered, How would a resource manager know to pick the
> "best" choice?

That depends on how it is told to pick, either by pre-ordained 
configuration, or automagic balancing algorithms, or a combination of 
the two.

> The cluster is made up of software components (see pretty picture
> attached).  IMHO, it would be good to follow some simple rules:
>
>  1. Components higher on the stack should only depend on
>      components lower on the stack.  Let's avoid circular
>      dependencies.
>
>  2. When possible, use "standard" components and APIs.
>      We have agreed that some common components:
>
>   DLM
>   cluster membership and quorum

Motherhood.

>   cluster communications (sort of)

There's no standard component or api, and nobody has proved one is 
needed.  (If somebody implements one and lots of things get smaller, 
faster and more reliable, that is proof.)

> AFAICT, resource management is higher up the stack and having shared
> storage like cluster snapshot depend on it, would cause circular
> dependencies.

Not only that, but after a read-through, rgmanager is not suited to low 
level use as currently conceived.  Just one of many problems: we don't 
want to be parsing XML in a block device failover path.  So I will stop 
bothering Lon about making this be what it's not.

> SM, is a Sistina/Redhat specific thing.  Might be wonderful, but it
> is not common.  David's email leads me to believe it is not the right
> component to interface with.

It's too large a hammer with which to hit this flea.

> So, what is currently implemented that we have to work with?
> Membership and DLM.  These are core services and see to be
> pretty solid right now.
>
> So how can we use these?  Seems fairly simple:
>
> 1st implementation:
> ===================
>
>  Add single DLM lock in csnap server.
>  When a snap shot target is started, start up a csnap server.
>  If the csnap server gets the lock, he is master.
>  In normal operation, the csnap server is up and running
>  on all nodes.  One node has the DLM  lock and the others
>  are ready to go, but waiting for the DLM lock to convert.
>  On failure, the next node to get the lock is master.

Something like that.  Actually, the job is done by the csnap agent, 
which is also responsible for handing server connections to csnap 
clients on the node.  (Whether the csnap server eventually becomes part 
of the csnap agent is another question.)  The agent does:

  Repeat as necessary:
    - get a Concurrent Read lock
    - if there's a server address in the lvb we're done
    - otherwise, convert to Protected Write without waiting
    - if we got it, write our server address to the lvb, we're done

This is driven by one or more csnap clients on the node noticing a 
broken server connection.

We also need to arrange for the csnap server to give up the PW lock if 
its node leaves the cluster.  The agent better subscribe for some sort 
of cluster management event here, except there isn't any such event 
except when gdlm is already dead, which isn't much use.  This is a big 
fat deficiency.

>  If administrator knows which machines is "best", have him
>  start the snapshot targets on that machine 1st.  Not perfect,
>  but simple and provides high availability.
>
>  It is also possible for the csnap server to put its
>  server address and port information in the LVB.
>
>  This seems simple, workable, and easy to program.

And maps easily to either gdlm or gulm, though somebody would have to 
write a userland interface to make this transparent.  (Lon?).

I haven't thought of any way to be lazier.  The other main alternative I 
looked into is to use a multicast message to request a server and a 
multicast message to announce a server, which is the sort of thing gdlm 
does internally anyway, but probably uses fewer messages and doesn't 
futz with LVBs.  However, an exclusive lock is still needed to protect 
the snapshot soup from too many cooks.

If we look closely at what's involved in getting that exclusive lock, we 
notice that the lock master will typically be the first node in the 
membership list, so why not just assign the agent on that node to 
anoint new servers and hand out server addresses?  Then we only have to 
worry about membership races, an interesting topic in itself.  This one 
goes on the back burner for when I have too much time on my hands.

> Questions:
>
>  I do understand what you mean by inherent raciness.
>  Once a cluster is up and running, the first csnap server
>  starts up.  It does not stop until it dies, which I assume
>  is rare.   What raciness are you talking about?

Failover: with the "grab a lock" node selection method, whoever is first 
to notice the old server died will probably end up starting the new 
one.  This doesn't qualify as resource management, it does however keep 
the cluster alive.

>  How complicated of a resource metric were you thinking about?

User defined, where one of the things the user can say is "automagic".  

A simple priority scheme would let the user assign a priority number for 
each node, and the resource manager picks the node with the highest 
priority (there is no point in distributing this algorithm).  An 
improved resource manager would collect load statistics to adjust the 
priority numbers.  Any priority adjustment would be done outside the 
failover path, so we would not need to worry about auditing that code 
for bounded memory use.

For now we will put aside grand designs and go with a crude method on 
the theory that "snapshot server node choice is too random" will not 
make the top ten list of things that suck most about our cluster any 
time soon.

>  I have read through design doc and still thinking about client
>  reconnect.  Are you planning on implementing the 4 message
>  snapshot read protocol?

Yes, it's easy to do and it guarantees fast failover.  It does however 
double the network latency of a snapshot (versus origin) read request, 
which is very visible on some loads.  This will end up as a per-client 
option I think.

>  There must be some internal cluster communication mechanisms
>  for membership (cman) and DLM to work.  Is there some reason why
>  these are not suitable for snapshot client to server
>  communication?

Csnap will happily use any SOCK_STREAM socket, its interface is just 
read/write and shutdown.  Cman/gdlm's messaging scheme is considerably 
fancier, for some reason that isn't yet clear to me.  For csnap, it 
would just be a source of extra task switches and other overhead, and 
more code.

Regards,

Daniel