Shared VM disk/image on gluster for redundancy?

jdarcy at redhat.com (Jeff Darcy) · Tue, 29 Jun 2010 09:10:28 -0400

On 06/29/2010 06:23 AM, Emmanuel Noobadmin wrote:
> I've been trying to find a solution to achieve the following
> objective: minimum delay redundant network storage for virtualized
> server and think gluster might be what I need after throwing out
> options like Lustre, dmraid on Openfiler etc.
> 
> The configuration in mind is currently this.
> 
> Application Servers
> -> Runs a few VM guest OS
> -> Runs gluster client/server
> -> VM machine images then stored on mirrored gluster volumes.
> 
> There will be two storage servers with physical RAID.
> 
> The concept is that
> 1. physical RAID 1 catches single disk failure on the storage server
> 2. gluster mirror on the application server catches single machine
> failure of the storage servers
>
> . . .
> 
> Would this work or am I missing something?

Not only would this work, but my impression is that it's a pretty common
use of GlusterFS.  The one thing I'd add is that, if you already have or
ever might have more than two application servers, you use cluster/nufa
as well as cluster/replicate (which is what volgen's "RAID 1" actually
does).  The basic idea here is to set up two (or more) subvolumes on
each server, like this:

	srv0vol0	srv1vol0	srv2vol0
	srv0vol1	srv1vol1	srv2vol1

Then you replicate "diagonally" with read-subvolume pointing to the top row:

	volume rep0
		type cluster/replicate
		option read-subvolume srv0vol0
		subvolumes srv0vol0 srv1vol1
	end-volume

	volume rep1
		type cluster/replicate
		option read-subvolume srv1vol0
		subvolumes srv1vol0 srv2vol1
	end-volume

	volume rep2
		type cluster/replicate
		option read-subvolume srv2vol0
		subvolumes srv2vol0 srv0vol1
	end-volume

Lastly, you apply NUFA with "local-volume-name" on each node pointing to
the replicated volume with its read-subvolume on the same machine.  So,
on node 1:

	volume my_nufa
		type cluster/nufa
		option local-volume-name rep1
		subvolumes rep0 rep1 rep2
	end-volume

With this type of configuration, files created on node 1 will be written
to srv1vol0/srv2vol1 and read from srv1vol0.  Note that you don't need
separate disks or anything to set up multiple volumes on a node; they
can just be different directories, though if they're directories within
the same local filesystem then "df" on the GlusterFS filesystem can be
misleading.  Extending the approach from three servers to any N should
be pretty obvious, and you can do the same thing with cluster/distribute
instead of cluster/nufa (they actually use the same code) if strong
locality is not a requirement.

> 3. Avoids any problems caused by a heartbeat/failover delay.
> 4. If an application server die, the VM images are still on the
> gluster volumes, I can simply distribute the downed VMs to the other
> running application server by loading the images.

You're not really getting rid of heartbeat/failover delay, so much as
relying on functionally equivalent behavior within GlusterFS.  Also,
you'll still need some sort of heartbeat to detect that an application
server has died.  Putting your images on GlusterFS makes it possible for
guests on multiple machines to access them, but it's still a bad idea
for them to do so simultaneously.