ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

landman at scalableinformatics.com (Joe Landman) · Thu, 29 Sep 2011 13:58:15 -0400

On 09/29/2011 01:44 PM, David Miller wrote:
> On Thu, Sep 29, 2011 at 1:32 PM, David Miller <david3d at gmail.com
> <mailto:david3d at gmail.com>> wrote:
>
>     Couldn't you  accomplish the same thing with flashcache?
>     https://github.com/facebook/flashcache/
>
>
> I should expand on that a little bit.  Flashcache is a kernel module
> created by Facebook that uses the device mapper interface in Linux to
> provide a ssd cache layer to any block device.
>
> What I think would be interesting is using flashcache with a pcie ssd as
> the caching device.  That would add about $500-$600 to the cost of each
> brick node but should be able to buffer the active IO from the spinning
> media pretty well.

Erp ... low end PCIe flash with decent performance start much higher 
than 500-600 $ USD.

> Somthing like this.
> http://www.amazon.com/OCZ-Technology-Drive-240GB-Express/dp/B0058RECUE
> or something from FusionIO if you want something that's aimed more at
> the enterprise.

Flashcache is reasonably good, but there are many variables in using it, 
and its designed for a different use case.  For most people the 
writeback may be reasonable, but other use cases would require different 
configs.

This said, please understand that it (and L2ARC, and other similar 
things) are *not* silver bullets (e.g. not magical things that will 
instantly make something far better, at no cost/effort).  They do 
introduce additional complexity, and additional tuning points.

The thing you cannot get rid of, the network traversal, is implicated in 
much of the performance degradation for small files.  Putting the file 
system on a RAM disk (if possible, tmpfs doesn't support xattrs), 
wouldn't make the system much faster for small files.  Eliminating the 
network traversal and doing local distributed caching of metadata on the 
client side ... could ... but this would be a huge new complication, and 
I'd argue that it probably isn't worth it.

For the short duration, small file performance is going to be bad.  You 
might be able to play some games to make this performance better (L2ARC 
etc. could help in some aspects, but they won't be universally much better).

What matters most is very good design on the storage backend (we are 
biased due to what it is we sell/support), very good networking, and 
very good gluster implementation/tuning.  Its real easy to hit very slow 
performance by missing critical elements.  We field many inquiries which 
usually start out with "we built our own and the performance isn't that 
good."  You won't get good performance on the cluster file system if the 
underlying file system and storage design isn't going to give it to you 
in the first place.

This said, please understand that there is a (significant) performance 
cost to all those nice features in ZFS.  And there is a reason why its 
not generally considered a high performance file system.  So if you 
start building with it, you shouldn't necessarily think that the whole 
is going to be faster than the sum of the parts.  Might be worse.

This is a caution from someone who has tested/shipped many different 
file systems in the past.  ZFS included, on Solaris and other machines. 
  There is a very significant performance penalty one pays for using 
some of these features.  You have to decide if this penalty is worth it.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615