Re: Glusterd 'Management Volume' proposal

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Top posting as these are mostly queries, than a comment on MV as described below.

1) With the current scheme in glusterd the O(N^2) is because the configuration is replicated to every peer in the cluster, correct?

- In the new approach (either MV or otherwise), the idea is to maintain a configuration cluster, or a set of nodes that have configuration related information in them, correct?

- The rest of the peers get the latest configuration as this changes, (which is the watch functionality that Jeff brings out) this part of the requirement is not covered in the proposal. Would help if this is elaborated as well.

- We do have the limitation now that some clients _may_ not have got the latest graph (one of the configuration items here), with the new proposal, is there any thought regarding resolving the same? Is it required? I assume brick nodes, have this strict enforcement today as well as in the future.

2) With a >1000 node setup, is it intended that we have a cascade functionality to handle configuration changes? I.e there are a defined set of _watchers_ to the configuration cluster, and each in turn serve a set of peers for their _watch_ functionality?

This maybe an overkill (i.e requiring cascading), but is it required when we consider cases like Geo-rep or tiers in different data centers etc. that need configuration updates and all of them watching the configuration cluster maybe a problem requiring attention?

Onto MV proposal,
- Using a smaller, pure replicate, gluster volume, sans a few xlators, with locking enforced by the consumers of the same seems like a good way to solve the replication, consistency and hence availability of the configuration information.

- And as you mention, a POSIX-y interface and an application on top of the same, seems heavy weight for a key-value store, that a configuration volume presents.

- We still need watcher functionality and possibly cascading support

I am _not_ well aware of the internals of etcd (or other frameworks being discussed) to compare what we can leverage from the same and what functionality it lacks, or the production worthiness of the code.

Going by your initial statements on the same, the concern seems to be dependency on another component, in terms of releases and required bug fixes. I would go on to state that if the infrastructure is production ready, then managing the dependency would be relatively easier. The real challenge would be how much effort needs to be spent understanding the internals and if we need to do the same, for us to be able to support this in gluster deployments. Any clues or ideas on this, to help make a decision?

Shyam

On 11/19/2014 02:22 AM, Krishnan Parthasarathi wrote:
All,

We have been thinking of many approaches to address some of Glusterd's correctness
(during failures and at scale) and scalability concerns. A recent email thread on
Glusterd-2.0 was along these lines. While that discussion is still valid, we have been
considering dogfooding as a viable option to solve our problems. This is not the first
time this has been mentioned but for various reasons didn't really take off. The following
proposal solves Glusterd's requirement for a distributed (consistent) store using a GlusterFS
volume. Then who manages that GlusterFS volume? To find answers for that and more
read further.

[The following content is also available here: https://gist.github.com/krisis/945e45e768ef1c4e446d
Please keep the discussions on the mailing list and _not_ in github, for tractibility
reasons.]


##Abstract

Glusterd, the management daemon for GlusterFS, maintains volume and cluster
configuration store using an home-grown replication algorithm. Some shortcomings
are as follows.

- Involves O(N^2) (in number of nodes) network messages to replicate
   configuration changes for every command

- Doesn't rely on quorum and not resilient to network partitions

- Recovery of nodes that come back online can choke the network at scale

The thousand node glusterd proposal[1], one of the more mature proposals
addressing the above problems, recommends use of a consistent distributed
stores like consul/etcd for maintaining the volume and cluster configuration.
While the technical merits of this approach make it compelling the operational
challenges like coordinating between the two communities for releases and
bug-fixes could get out of hand.  An alternate approach[2] is to use a
replicated GlusterFS volume as the distributed store instead. The remainder of
this email explains how a GlusterFS volume could be used to store configuration
information.


##Technical details

We will refer to the replicated GlusterFS volume used for storing configuration
as the Management volume (MV). The following section describes how MV would be
managed.


###MV management

To begin with we can restrict the MV to a pure replicated volume with a maximum
of 3 bricks on 3 different nodes[3]. The brick path can be stored in glusterd.vol
which is packaged. MV will come into existence only after the first peer probe
or first volume create operation.

The following example of setting up a Glusterfs storage cluster highlights how
things work in the proposed scheme of things.

- Install glusterfs server packages on a storage node.

- Start glusterd service.

- Create a volume. --> Now, the MV is created with one brick and mounted under
   /var/lib/glusterd

- Add a peer to the cluster --> Now, MV is expanded to a 2-way replicated
   volume with the second brick in the new peer. MV is mounted in the new peer
   under /var/lib/glusterd.

- Create more volumes.

- Add the third peer to the cluster --> MV is expanded to a 3-way replicated
   volume with the third brick in the new peer. MV is mounted under
   /var/lib/glusterd in the new peer. This is the last time MV is expanded.

- Any further peers added to the cluster would only mount the MV under
   /var/lib/glusterd.

The above restrictions placed on MV allow us to escape the need for a robust distributed
store for MV's volume information and volume files.

###Configuration details of MV
- peers that are hosting bricks for MV would have a boolean option in glusterd.vol.
For e.g something like,
         option mv_host on

- The brick path for MV would have a default from the packaged glusterd.vol
For e.g,
         option mv_brick /mv/brick

- Replica count. This could be stored as part of glusterd.vol too.
For e.g,
         option mv_replica 3

- The ports for MV bricks could be reserved by glusterd's port mappper.  For
   e.g, 49152 could be reserved for MV brick on each node, given that we would
   have only one MV brick per peer.

- options to be set on volume - client-quorum, optionally proactive self heal enabled.

- MV would benefit from client-side quorum, server-side quorum and other
   options. These could be preset (packaged) as part of glusterd.vol too.

- With brick path, ports and volume options present in glusterd.vol or preset
   we can build the in-memory volume info representation on initialization of
   glusterd.  This means we can generate MV's volume file dynamically in each MV
   hosting peer when needed and store in a 'known' location in local disk.


###Changes in glusterd command execution

Each peer modifies its configuration in /var/lib/glusterd in the commit phase
of every command execution. With the introduction of MV, the peer in which the
command is executed will perform the modifications to the configuration in
/var/lib/glusterd after commit phase on the remaining available peers. Note,
the other nodes don't perform any updates to MV.


###How to replace a 'dead' server/peer?

At the moment, I haven't thought of an automatic (or near semi-automatic) way
of replacing a 'dead' peer. The manual steps should be as follows,

- If the 'dead' peer doesn't host MV bricks then the procedure as in previous
   versions. This approach doesn't change anything.

- Provision a new server. Install glusterfs packages.

- Modify the glusterd.vol to have
         option mv_host on
         option mv_replica 3 #as the case may be

- Probe the peer to the cluster. glusterd on initialization would replace its
   MV brick in MV and replication's healing should replicate the configuration.

N.B This procedure assumes default MV config parameters. For non-default configuration,
the brick path should also be updated in glusterd.vol in the new peer.


###How to upgrade from current version?

Following would be the steps,

- Stop all gluster{d,fs,fsd} processes by stopping the corresponding services.

- Upgrade to this version of glusterfs packages.

- Choose at most 3 servers/peers to build MV. In these nodes, create the
   default brick directories; modify the (new) glusterd.vol to have
         option mv_host on
   Set replica count on each peer's glusterd.vol
         option mv_replica 3 #say

- Move /var/lib/glusterd contents on each peer a to a temporary directory.
   Say, /var/lib/glusterd.bkp

- Start glusterd service on one of the nodes, in 'upgrade' mode. In this mode,
   glusterd would start the MV bricks and mount it on /var/lib/glusterd. It will
   not serve cli or mount requests.

- Copy the contents of /var/lib/glusterd.bkp on to (the mounted)
   /var/lib/glusterd.

- Repeat this on all nodes in the cluster.

- Stop glusterd on all nodes. Start glusterd service on all nodes (in 'normal'
   mode).

- Now the storage cluster should be ready for improved operations.


###How to upgrade from this version to future versions?

This is trickier than it should be given that we are holding MV's configuration
in glusterd.vol, which is packaged. I would like to hear from the community for
suggestions on this.


###References
[1] - http://www.gluster.org/community/documentation/index.php/Features/thousand-node-glusterd.

[2] - This approach was initially recommended by Jeff Darcy, who is also the
author of [1].

[3] - It shouldn't be hard to allow expanding MV beyond 3 bricks but most distributed configuration
       stores recommend 3 or 5 way replication. At the least this could be made configurable
       via glusterd.vol.
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-devel




[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux