All, I mentioned a few weeks ago that we were doing some testing of ZFS & Gluster. We brought up two 4-year old but still reasonably beefy machines with the following specs: Hardware: Supermicro X7DB8 Storage server / 64 GB (16x4GB 667GHz ECC) 2 x Xeon Quadcore E5450 @ 3.00GHz 2 x Marvell MV88SX6081 8-port SATA II PCI-X 1 x OCZ 50GB RevoDrive OCZSSDPX-1RVD0050 1 x Mellanox Technologies MT25204 [InfiniHost III Lx HCA] 16 x 2TB WD SATA Consumer grade drives Software: Kernel 2.6.37.6-0.7 (openSUSE 11.4) ZFSOnLinux 0.6.0 -RC5 GlusterFS 3.2.3 machine1 was setup with WD 2TB Caviar Black consumer grade drives (5y warranty) and machine2 was setup with the cheapest WD 2TB Caviar Green. Other notes on the hardware: * We chose Marvell SATA controllers because the Sun Thumpers (the mother of all ZFS devices) were also using them and a local ZFS expert has advised against using combined SAS/SATA controllers. We chose PCI-X because this older hardware did not have enough PCIe slots. * The OCZ revo drive is kind of small but this one is quite fast and costs only $250 We created one zfs tank with raidz on each machine which resulted in 2 x 25TB space. The PCIe Revodrives have 2 x 25GB flash. We used one as the ZIL log device and the other one as L2ARC cache (The ZIL device should probably be mirrored in a production system, the L2ARC device can apparently go away without causing FS corruption) Setting ashift=12 for the green drives was the only change we made from the default ZFS config (there is a lot of documentation how to make these green drives usable with ZFS.) Our first impression was that either the system memory or the L2ARC provided really good write caching for short sequential writes (a lot of our work is like that) iotop showed 700+ MB/s. We ran bonnie++ a couple of times but we are no benchmark experts, see the results at the end of this note. We'd be happy to run other benchmarks if someone thinks that would make more sense. It was interesting to see that the green drives reached about 90% of the throughput performance of the black drives. Another type of workload we have in our environment is HPC users who create many small files and delete them quickly afterwards. To simulate this we created a silly little python script that made lots of 1k files with slightly different content: *****scratch.py *************************************************** #! /usr/bin/python import sys mystr=""" alskjdhfka;kajf akjhfdskajshf k;ajhsdf;kajhf k;jah ........another 1000 random chars """ for i in xrange(int(sys.argv[1])): file = "file-%s" % i fh = open(file,"w") fh.write(str(i)+mystr) fh.close() *********************************************************** The script created 10000 files and then 100000 files locally on Machine1. First we ran this locally on the boot drives (ext4): create 10000 files: sub seconds ls -la on 10000 files: sub seconds rm * on 10000 files: sub seconds create 100000 files: 4s ls -la on 100000 files: 1s rm * on 100000 files: 3s After cleaning the cache we ran this on the ZFS filesystem: create 10000 files: 3s ls -la on 10000 files: sub seconds rm * on 10000 files: sub seconds create 100000 files: 92s ls -la on 100000 files: 1s rm * on 100000 files: 8s at first this is kind of disappointing because zfs seems to be much slower. However, a ZFS file server is almost never used as local storage in a compute box. You'll have to measure this via NFS/GlusterFS and then the story looks slightly different: First we tried to do the same thing on a fast compute box that is connected to a fast NetApp 3050 with 60 10k RPM FC drives: create 10000 files: 14s ls -la on 10000 files: 1s rm * on 10000 files: 4s Then we used a Solaris file server (Dell R810 connected to a 600TB FC SAN that consists mostly of SATA drives) that is mounted on the same box: create 10000 files: 32s (NFSv4: 53s) ls -la on 10000 files: 6s (NFSv4: sub seconds) rm * on 10000 files: 15s (NFSv4: 20s) Then we used our new 50TB ZFS gluster system (Machine1 and Machine2, distributed, 2 bricks) mounted via glusterfs from the client create 10000 files: 27s ls -la on 10000 files: 8s rm * on 10000 files: 13s I admit we don't have any really really fast storage and more tests need to be done. But when we look what our current needs are our ZFS gluster would do quite well compared to existing equipment. We did not have any crashes ZFS so far, everything worked very stable. On the throughput front we have not connected these systems to 10G but the numbers indicate that this 2 node gluster cluster could probably fill an entire 10G pipe. We also wanted to experiment with infiniband/RDMA but the drivers seem to be unsupported in later Linux kernels (e.g 2.6.37). If anyone has a howto please let me know ######## BENCHMARK ##################### Machine1, 16 WD caviar black, 50GB SSD, ashift=9 /loc/bonnie # bonnie++ -d /loc/bonnie -s 22g -r 11g -n 0 -f -b -u root Using uid:0, gid:0. Writing intelligently...done Rewriting...done Reading intelligently...done start 'em...done...done...done... Version 1.03d ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP Machine1 22G 380617 95 332375 92 1015229 99 3380 7 Machine1,22G,,,380617,95,332375,92,,,1015229,99,3380.1,7,,,,,,,,,,,,, /loc/bonnie # /loc/bonnie # bonnie++ -d /loc/bonnie -s 96g -r 16g -n 0 -f -b -u root Using uid:0, gid:0. Writing intelligently...done Rewriting...done Reading intelligently...done start 'em...done...done...done... Version 1.03d ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP Machine1 96G 336650 91 197629 70 472625 74 141.4 1 Machine1,96G,,,336650,91,197629,70,,,472625,74,141.4,1,,,,,,,,,,,,,