Re: [Linux-cluster] Interfacing csnap to cluster stack

Lon Hohberger <lhh@xxxxxxxxxx> · Thu, 07 Oct 2004 12:08:10 -0400

On Thu, 2004-10-07 at 02:07, David Teigland wrote:

> - Most of the time, if you think you should use SM, you're probably wrong
> and don't really know what it's for.
> 
> - SM is nothing like a Resource Manager; they are completely different
> things and do not interact with each other.  If you want them to
> interact you are probably still very confused.

- rgmanager is a replacement for what used to be called a "Service
Manager", also known as an "Application Manager".  The "Resource Group"
model is a model where the resource group is the atomic unit of
failover.

> - A Resource Manager does not need to use the SM.  A RM is fundamentally
> about starting and monitoring system services or applications.  These
> services do /not/ include the symmetric "services" related to SM (fence
> domain manager, dlm and gfs).  If you think it might make sense for RM to
> manage, say gfs, you are still seriously confused.

- Agreed.  Magma (and thus rgmanager) uses the service manager, but only
as a method to determine other cluster nodes also running rgmanager -
and thus being in the service group.  When a node leaves the SG, we
don't care about it anymore; and we move resource groups off of it.

> - Asymmetric services/applications can often make use of a Resource
> Manager.  Client-server systems have this fundamental HA problem because
> the server is by definition a single point of failure (something absent
> from symmetric systems.)  RM comes into the picture to address this
> problem by monitoring the server from above and restarting the server
> (possibly elsewhere) if it fails.  A prime example is NFS.  RM is able to
> monitor an NFS server and start it on another machine if it fails.  NFS is
> probably the model you should follow if your system is asymmetric and you
> want to use RM.  Perhaps a study of how that works is in order.

- This was the model that I pointed out.  Evidently, although 'cute', it
just will not work for csnap server failover.  Works well for other
things, though.

> - To me there are two obvious, well defined and understood methods to
> "clusterize" the csnap system.  One uses RM (and not SM) like NFS, the
> other uses SM (and not RM) with a more symmetric looking integrated
> client-server.  Using DLM locks is a third way you might solve this
> problem without using SM or RM; I don't understand the details of how that
> might work yet but it sounds interesting.

It's primarily because it assumes that the DLM or GuLM can ensure that
only one exclusive lock is granted at a time.  Because of this, the
holder of the lock would thereby become the csnap master server, and the
old master will either have been fenced or relinquished its duties
willfully (and thus is no longer a threat).

This ensures the requirement that exactly one master server exists at a
time, while allowing the ability to have multiple servers processing
other requests (I think Daniel mentioned that he might like to have this
ability).

Ironically, it's a similar model to having the csnap server running on
all nodes and having rgmanager activate/deactivate the master server,
but should be much faster.  Furthermore, it eliminates the need for
rgmanager (or any CRM, really) entirely.

-- 
Lon Hohberger <lhh@xxxxxxxxxx>
Red Hat, Inc.