question re. current state of art/practice

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Just as a small warning, Sheepdog is well away from being production ready -
the big problem that I can see is that it won't de-allocate blocks once an
image is deleted! It definitely has promise for the future though.

We set up a 4 node cluster using Gluster with KVM late last year - which has
been running along quite nicely for us. 

To quote myself from a forum post I made a few weeks ago (note, prices are
in Australian dollars):

In short, we built a VM cluster without a traditional SAN based on free
Linux-based software for a comparatively small amount of money. The company
has had some reasonably explosive growth over the past 18 months, so the
pressure has been on to deliver more power to the various business units
without breaking the bank.

We've been running VMware reasonably successfully, so the obvious option was
to just do that again. We called our friendly Dell rep for a quote using the
traditional SAN, some servers, dual switches etc - which came back at around
$40,000 - that wasn't going to fly, so we needed to get creative.

The requirements for this particular cluster were fairly modest, immediate
need for a mix of ~20 moderate-use Windows and Linux VMs, ability to scale
3-4x in the short term, cheaper is better and full redundancy is a must.

Gluster lets you create a virtual "SAN" across multiple nodes, either
replicating everything to every node (RAID1), distributing each file to a
separate node (JBOD / RAID0 I guess you could call it at a stretch) or a
combination of the two (called distribute/replicate in Gluster terms). After
running it on a few old boxes as a test, we decided to take the plunge and
build up the new cluster using it.

The tools of choice are KVM for the actual virtualisation and GlusterFS to
run the storage all built on top of Debian.

The hardware
We're very much a Dell house, so this all lives on Dell R710 servers, each
with
* 6 core Xeon
* 24GB RAM
* 6x 300GB 15k SAS drives
* Intel X520-DA2 dual port 10 GigE NIC (SFP+ twinax cables)
* Usual DRAC module, ProSupport etc.

All connected together using a pair of Dell 8024F 10 Gig eth switches
(active / backup config). All up, the price was in the region of $20,000.
Much better.

Gluster - storage
Gluster is still a bit rough around the edges, but it does do what we needed
reasonably well. The main problem is that if a node has to resync (self-heal
in Gluster terms), it locks the WHOLE file for reading across all nodes
until the sync is finished. If you have a lot of big VM images, this can
mean that the storage for them "disappears" as far as KVM is concerned while
the sync happens, leading the VM to hard crash. Even with 10 Gig Eth and
fast disks, moving several-hundred-GB images takes a while. Currently, if a
node goes offline, we leave it dead until such time as we can shut down all
of the VMs and bring it back up gracefully. Only happened once so far
(hardware problem), but it is something that certainly is worrying.

This is apparently being fixed in the next major release (3.3), due out mid
this year. A point to note is that Gluster was recently acquired by Red Hat,
so that is very positive.

We did have a few stability problems with earlier versions, but 3.2.5 runs
smoothly from what we've thrown at it so far.

KVM - hypervisor
Very powerful, performs better than VMware in our testing, and totally free.
We use libvirt with some custom apps to manage everything, but 99% of it is
set-and-forget. For those who want a nice GUI in Windows, sadly there is
nothing that is 100% there yet. For those of us who prefer to use command
lines, virsh from libvirt works very well.

The only problem is that the concept of live-snapshots doesn't currently
exist - the VM has to be paused or shut down to take a snapshot. We've
tested wedging LVM under the storage bricks, which lets us snapshot the base
storage and grab our backups from there. It seems to work OK on the test
boxes, but it isn't in production yet. Until we can get a long enough window
to set that up, we're doing ad-hoc manual snapshots of the VMs, usually
before/after a major change when we have an outage window already.
Day-to-day data is backed up the traditional way.

What else did we look at?
There are a number of other shared storage / cluster storage apps out there
(Ceph, Lustre, DRBD etc), all of which have their own little problems.

There are also a lot of different hypervisors out there (Xen being the main
other player), but KVM fit our needs perfectly.

We looked at doing 4x bonded gigabit ethernet for the storage - but Dell
offered us a deal that wasn't much difference to step up to 10 Gig Eth, and
it meant that we could avoid the fun of dealing with bonded links.
Realistically, bonded gigabit would have done the job for us, but the deal
we were offered made it silly to consider doing that. If memory is right,
the 8024F units were brand new at the time, so I think they were trying to
get some into the wild.

Conclusion
There are certainly a few rough edges (primarily to do with storage), but
nothing show-stopping for what we needed at the moment. In the end, the
business is happy, so we're happy.

I'd definitely say this approach at least worth a look for anyone building a
small VM cluster. Gluster is still a bit too immature for me to be 100%
happy with it, but I think it's going to be perfect in 6 months time. At the
moment, it is fine to use in production with a bit of care and knowledge of
the limitations.


-----Original Message-----
From: gluster-users-bounces at gluster.org
[mailto:gluster-users-bounces at gluster.org] On Behalf Of Miles Fidelman
Sent: Friday, 17 February 2012 5:47 AM
Cc: gluster-users at gluster.org
Subject: Re: question re. current state of art/practice

Brian Candler wrote:
> On Wed, Feb 15, 2012 at 08:22:18PM -0500, Miles Fidelman wrote:
>> We've been running a 2-node, high-availability cluster - basically 
>> xen w/ pacemaker and DRBD for replicating disks.  We recently 
>> purchased 2 additional,  servers, and I'm thinking about combining 
>> all 4 machines into a 4-node cluster - which takes us out of DRBD 
>> space and requires some other kind of filesystem replication.
>>
>> Gluster, Ceph, Sheepdog, and XtreemFS seem to keep coming up as 
>> things that might work, but... Sheepdog is too tied to KVM
> ... although if you're considering changing DRBD->Gluster, then 
> changing
> Xen->KVM is perhaps worth considering too?

Considering it, but... Sheepdog doesn't seem to have the support that
Gluster does, and my older servers don't have the processor extensions
necessary to run KVM.  Sigh....

>> i.  Is it now reasonable to consider running Gluster and Xen on the 
>> same boxes, without hitting too much of a performance penalty?
> I have been testing Gluster on 24-disk nodes :
>    - 2 HBAs per node (one 16-port and one 8-port)
>    - single CPU chip (one node is dual-core i3, one is quad-core Xeon)
>    - 8GB RAM
>    - 10G ethernet
> and however I hit it, the CPU is mostly idle. I think the issue for 
> you is more likely to be one of latency rather than throughput or CPU 
> utilisation, and if you have multiple VMs accessing the disk 
> concurrently then latency becomes less important.
>
> However, I should add that I'm not running VMs on top of this, just 
> doing filesystem tests (and mostly reads at this stage).
>
> For what gluster 3.3 will bring to the table, see this:
> http://community.gluster.org/q/can-i-use-glusterfs-as-an-alternative-n
> etwork-storage-backing-for-vm-hosting/

Thanks!  That gives me some hard info.  I'm starting to think waiting for
3.3 is a very good idea.  Might start playing with the beta.

Miles

--
In theory, there is no difference between theory and practice.
In practice, there is.   .... Yogi Berra


_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
-- 
Message  protected by MailGuard: e-mail anti-virus, anti-spam and content
filtering.
http://www.mailguard.com.au




[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux