distributed storage and computing

dmons at cuttingedge.com.au (Dan Mons) · Tue, 29 Oct 2013 09:50:56 +1000

On 28 October 2013 23:13, Tim van Elteren <timvanelteren at gmail.com> wrote:

Background: VFX and Post Production studio that relies on Gluster for
all of it's production data storage.  Data is mostly in the shape of
image sequences (raw 2K and 4K footage stored in individual frames,
clocking in at 5-100MB per frame, to be played back at 24 or 48 FPS).
We create content from a render farm consisting of many Linux boxes
spewing out these images as fast as they can.

> 1) transparency to the processing layer / application with respect to data
> locality, e.g. know where data is physically located on a server level,
> mainly for resource allocation and fast processing, high performance, how
> can this be accomplished using GlusterFS?

We mount local GlusterFS clusters (we have them at each site) under
the same mount point.  We simply rsync chunks of the tree between
sites as required on a job by job basis (we have "pipeline tools" we
write to make this easier for end users, so they call a wrapper
script, and not rsync directly).  There are remote mount points for
users to mount a remote site over our WAN and browse it without
needing to rsync big chunks of data if they prefer.   This isn't
GlusterFS's problem, per se, but just a regular remote file system
problem.  With that said, GlusterFS's geo-replication may assist you
if you want to automate it more.

> 2) posix compliance, or conformance: hadoop for example isn't posix
> compliant by design, what are the pro's and con's? What is GlusterFSs
> approach with respect to support for posix operations?

We run a setup of 70% Linux, 20% MacOSX and 10% Windows, all of which
require regular file semantics (not object storage).  The 90% of our
client base that requires a POSIX compliant file system has no
problems with it.  We re-export all shares via Samba3 for Windows
clients, but we are in the process of totally removing Windows from
our setup for a variety of reasons (mostly because it isn't POSIX
complaint, and we need to support a whole new set of tools to deal
with it, and it sucks for our industry).

Anything we need to do on a regular EXT3/EXT4 file system, we can do
on GlusterFS (symlinks, hard links, tools like "rsnapshot", etc).
Users don't see any difference between it and any regular NFS-mounted
NAS type device.

> 3) mainly with respect to evaluating the production readiness of GlusterFS,
> where is it currently used in production environments and for what specific
> usecases it seems most suitable? Are there any known issues / common
> pitfalls and workarounds available?

We've been running 3.3.1 since March this year, and I'm upgrading to
3.4.1 over the coming weeks.  I've had outages, but it's never been
Gluster's fault.  I initially put AFP for our Macs on our Gluster
nodes, which was a mistake, and has since been removed (MacOSX Finder
is too slow over SMB due to constant resource fork negative lookups,
and we've finally figured out how to get Macs to NFS mount from
Gluster without locking up Finder anyway so we're migrating away from
AFP).  Likewise we've had some hardware and network switch failures
that have taken out multiple nodes.  Recovery was quick enough though
(I think 30 minutes was the worst outage, but again not Gluster's
fault).

Because we use a distribute+replicate setup, our single-threaded write
speeds aren't amazing.  But Gluster's power is in the cluster itself,
not any one write thread, so over the course of millions of rendered
frames, the performance is better than any single NAS device could
give us under the same client load.

Some applications do silly things - we have a handful of apps where
the "Open File" dialog insists on doing background reads of whatever
tree it's browsing, making file browsing excruciatingly slow.  But
again, these are rare and typically only on Windows, so we'll be
leaving that behind soon.  The other drama we had was a cpio write
operation from one of our production apps was very slow to GlusterFS
(GlusterFS doesn't seem to like anything that requests a portion of a
file only), so we wrote a wrapper script to save to a local tmpfs, and
then copy that back to GlusterFS.  That was only for one operation out
of thousands though, and was easy enough to solve (and it gave us the
ability to extend version control into the wrapper script, which we'd
wanted to do anyway).

> 4) It seems Gluster has the advantage in Geo replication versus for example
> Ceph. What are the main advantages here?

We don't use this, so I can't comment.  If I required my remote and
local trees to be in sync, I'd definitely use it.  But end users here
are happy to transfer data only when required (because even with 800TB
online, space is still precious).

> 5) Finally what would be the most compelling reason to go for Gluster and
> not for the alternatives?

For us, we needed simple file semantics (we specifically don't need
object storage for OpenStack or Hadoop type operations).  That gave us
4 options:

1) Continue with our legacy setup of many NAS units.  Pros: cheap.
Cons: inflexible to share storage space between departments, single
point of failure per share

2) Buy a NAS or SAN from a vendor.  Pros: simple, easily expandible.
Cons: proprietary, expensive, vendor/product lock-in for future
upgrades

3) Proprietary clustered file system (IBM GPFS, etc).  Same pros/cons
as a SAN, quite frankly.

4) Ceph or Lustre.  Pros: open source, usual clustered storage
benefits, central name space, etc  Cons: in-kernel, required many
nodes for a good high-availability setup to start with, needs a few
smart people around to keep it running

5) GlusterFS: Pros: open source, low node count for basic rollout
(cheaper to start), usual clustered storage benefits, MUCH simpler
than Ceph / Lustre.  Cons: usual clustered storage cons (single
threaded write speed, etc), young-ish technology

GlusterFS won the day due to simplicity and the ability to start
small.  it keeps up with our business needs, and lets us pick and
choose our hardware (including mixing and matching different
specs/vendors of hardware within the one cluster into the future).
Additionally, RedHat's backing of Gluster finally cemented for us that
it was worthwhile, and it wasn't going to be a project that just
vanished overnight or suffered from production-breaking changes
between releases.

As suggested, you should test for yourself, and not take anyone's word
on anything.  But for us, they were the decisions we made, and why.
HTH.

-- 
Dan Mons
R&D SysAdmin
Unbreaker of broken things
Cutting Edge
http://cuttingedge.com.au