ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

landman at scalableinformatics.com (Joe Landman) · Sun, 25 Sep 2011 08:51:42 -0400

On 09/25/2011 03:56 AM, Di Pe wrote:

> So far the discussion has been focusing on XFS vs ZFS. I admit that I
> am a fan of ZFS and I have only used XFS for performance reasons on
> mysql servers where it did well. When I read something like this
> http://oss.sgi.com/archives/xfs/2011-08/msg00320.html that makes me
> not want to use XFS for big data. You can assume that this is a real

This is a corner case bug, and one we are hoping we can get more data to 
the XFS team for.  They asked for specific information that we couldn't 
provide (as we had to fix the problem).  Note: other file systems which 
allow for sparse files *may* have similar issues.  We haven't tried yet.

The issues with ZFS on Linux have to do with legal hazards.  Neither 
Oracle, nor those who claim ZFS violates their patents, would be happy 
to see license violations, or further deployment of ZFS on Linux.  I 
know the national labs in the US are happily doing the integration from 
source.  But I don't think Oracle and the patent holders would sit idly 
by while others do this.  So you'd need to use a ZFS based system such 
as Solaris 11 express to be able to use it without hassle.  BSD and 
Illumos may work without issue as well, and should be somewhat better on 
the legal front than Linux + ZFS.  I am obviously not a lawyer, and you 
should consult one before you proceed down this route.

> recent bug because Joe is a smart guy who knows exactly what he is
> doing. Joe and the Gluster guys are vendors who can work around these
> issues and provide support. If XFS is the choice, may be you should
> hire them for this gig.
>
> ZFS typically does not have these FS repair issues in the first place.
> The motivation of Lawrence Livermore for porting ZFS to Linux was
> quite clear:
>
> http://zfsonlinux.org/docs/SC10_BoF_ZFS_on_Linux_for_Lustre.pdf
>
> OK, they have 50PB and we are talking about much smaller deployments.
> However some of the limitations they report I can confirm. Also,
> recovering from a drive failure with this whole LVM/Linux Raid stuff
> is unpredictable. Hot swapping does not always work and if you
> prioritize the re-sync of data to the new drive you can strangle the
> entire box (by default the priority of the re-sync process is low on
> linux). If you are a Linux expert you can handle this kind of stuff
> (or hire someone) but if you ever want to give this setup to a Storage
> Administrator you better give them something that they can use with
> confidence (may be less of an issue in the cloud).
> Compare to this to ZFS: re-silvering works with a very predictable
> result and timing. There is a ton of info out there on this topic.  I
> think that gluster users may be getting around many of the linux raid
> issues by simply taking the entire node down (which is ok in mirrored
> node settings) or by using hardware raid controllers. (which are often
> not available in the cloud )

There are definite advantages to better technology.  But the issue in 
this case is the legal baggage that goes along with them.

BTRFS may, eventually, be a better choice.  The national labs can do 
this with something of an immunity to prosecution for license violation, 
by claiming the work is part of a research project, and won't actively 
be used in a way that would harm Oracle's interests.  And it would be 
... bad ... for Oracle (and others) to sue to government over a 
relatively trivial violation.

Until Oracle comes out with an absolute declaration that its OK to use 
ZFS with Linux in a commercial setting ... yeah ... most vendors are 
gonna stay away from that scenario.

> Some in the Linux community seem to be slightly opposed to ZFS (I
> assume because of the licensing issue) and make sometimes odd
> suggestions ("You should use BTRFS").

Licensing mainly.  BTRFS has a better design, but its not ready yet. 
Won't be for a while.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615