ZFS + Linux + Glusterfs for a production ready 100+ TB NAS on cloud

dipeit at gmail.com (Di Pe) · Mon, 10 Oct 2011 02:07:05 -0700

All,

I mentioned a few weeks ago that we were doing some testing of ZFS &
Gluster. We brought up two 4-year old but still reasonably beefy
machines with the following specs:

Hardware:

Supermicro X7DB8 Storage server / 64 GB (16x4GB 667GHz ECC)
2 x Xeon Quadcore E5450  @ 3.00GHz
2 x Marvell  MV88SX6081 8-port SATA II PCI-X
1 x OCZ 50GB RevoDrive OCZSSDPX-1RVD0050
1 x Mellanox Technologies MT25204 [InfiniHost III Lx HCA]
16 x 2TB WD SATA Consumer grade drives

Software:

Kernel 2.6.37.6-0.7 (openSUSE 11.4)
ZFSOnLinux 0.6.0 -RC5
GlusterFS 3.2.3

machine1 was setup with WD 2TB Caviar Black consumer grade drives (5y
warranty) and machine2 was setup with the cheapest WD 2TB Caviar
Green.

Other notes on the hardware:
* We chose Marvell SATA controllers because the Sun Thumpers (the
mother of all ZFS devices) were also using them and a local ZFS expert
has advised against using combined SAS/SATA controllers. We chose
PCI-X because this older hardware did not have enough PCIe slots.
* The OCZ revo drive is kind of small but this one is quite fast and
costs only $250

We created one zfs tank with raidz on each machine which resulted in 2
x 25TB space. The PCIe Revodrives have 2 x 25GB flash. We used one as
the ZIL log device and the other one as L2ARC cache (The ZIL device
should probably be mirrored in a production system, the L2ARC device
can apparently go away without causing FS corruption)

Setting ashift=12 for the green drives was the only change we made
from the default ZFS config (there is a lot of documentation how to
make these green drives usable with ZFS.)

Our first impression was that either the system memory or the L2ARC
provided really good write caching for short sequential writes (a lot
of our work is like that) iotop showed 700+ MB/s.

We ran bonnie++ a couple of times but we are no benchmark experts, see
the results at the end of this note. We'd be happy to run other
benchmarks if someone thinks that would make more sense. It was
interesting to see that the green drives reached about 90% of the
throughput performance of the black drives.

Another type of workload we have in our environment is HPC users who
create many small files and delete them quickly afterwards. To
simulate this we created a silly little python script that made lots
of 1k files with slightly different content:

*****scratch.py ***************************************************
#! /usr/bin/python
import sys
mystr="""
alskjdhfka;kajf akjhfdskajshf k;ajhsdf;kajhf k;jah ........another
1000 random chars
"""
for i in xrange(int(sys.argv[1])):
  file = "file-%s" % i
  fh = open(file,"w")
  fh.write(str(i)+mystr)
  fh.close()
***********************************************************

The script created 10000 files and then 100000 files locally on Machine1.

First we ran this locally on the boot drives (ext4):

create 10000 files:      sub seconds
ls -la on 10000 files:    sub seconds
rm * on 10000 files:     sub seconds

create 100000 files:      4s
ls -la on 100000 files:   1s
rm * on 100000 files:    3s

After cleaning the cache we ran this on the ZFS filesystem:

create 10000 files:      3s
ls -la on 10000 files:    sub seconds
rm * on 10000 files:     sub seconds

create 100000 files:     92s
ls -la on 100000 files:  1s
rm * on 100000 files:    8s

at first this is kind of disappointing because zfs seems to be much
slower. However, a ZFS file server is almost never used as local
storage in a compute box. You'll have to measure this via
NFS/GlusterFS and then the story looks slightly different:

First we tried to do the same thing on a fast compute box that is
connected to a fast NetApp 3050 with 60  10k RPM FC drives:

create 10000 files:      14s
ls -la on 10000 files:     1s
rm * on 10000 files:      4s

Then we used a Solaris file server (Dell R810 connected to a 600TB FC
SAN that consists mostly of SATA drives) that is mounted on the same
box:

create 10000 files:       32s  (NFSv4: 53s)
ls -la on 10000 files:     6s   (NFSv4: sub seconds)
rm * on 10000 files:      15s  (NFSv4: 20s)

Then we used our new 50TB ZFS gluster system (Machine1 and Machine2,
distributed, 2 bricks) mounted via glusterfs from the client

create 10000 files:       27s
ls -la on 10000 files:     8s
rm * on 10000 files:      13s

I admit we don't have any really really fast storage and more tests
need to be done. But when we look what our current needs are our ZFS
gluster would do quite well compared to existing equipment. We did not
have any crashes ZFS so far, everything worked very stable.

On the throughput front we have not connected these systems to 10G but
the numbers indicate that this 2 node gluster cluster could probably
fill an entire 10G pipe.

We also wanted to experiment with infiniband/RDMA but the drivers seem
to be unsupported in later Linux kernels (e.g 2.6.37). If anyone has a
howto please let me know

########   BENCHMARK #####################

Machine1, 16 WD caviar black, 50GB SSD, ashift=9

/loc/bonnie # bonnie++ -d /loc/bonnie -s 22g -r 11g -n 0 -f -b -u root
Using uid:0, gid:0.
Writing intelligently...done
Rewriting...done
Reading intelligently...done
start 'em...done...done...done...
Version 1.03d       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
Machine1           22G           380617  95 332375  92
1015229  99  3380   7
Machine1,22G,,,380617,95,332375,92,,,1015229,99,3380.1,7,,,,,,,,,,,,,
/loc/bonnie #
/loc/bonnie # bonnie++ -d /loc/bonnie -s 96g -r 16g -n 0 -f -b -u root
Using uid:0, gid:0.
Writing intelligently...done
Rewriting...done
Reading intelligently...done
start 'em...done...done...done...
Version 1.03d       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
Machine1          96G           336650  91 197629  70           472625
 74 141.4   1
Machine1,96G,,,336650,91,197629,70,,,472625,74,141.4,1,,,,,,,,,,,,,