Re: use ZFS for OSDs

Quenten Grasso <qgrasso@xxxxxxxxxx> · Wed, 15 Apr 2015 03:52:05 +0000

Hi Michal,

Really nice work on the ZFS testing.

I've been thinking about this myself from time to time, However I wasn't sure if ZoL was ready to use in  production with Ceph.

I would like to see instead of using multiple osd's in zfs/ceph but running say a z+2 for say 8-12 3-4TB spinners and leverage some nice SSD's maybe a P3700 400GB
for the zil/l2arc with compression and going back to 2x replicas which then this could give us some pretty fast/safe/efficient storage.

Now to find that money tree.

Regards,
Quenten Grasso

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Michal Kozanecki
Sent: Friday, 10 April 2015 5:15 AM
To: Christian Balzer; ceph-users
Subject: Re:  use ZFS for OSDs

I had surgery and have been off for a while. Had to rebuild test ceph+openstack cluster with whatever spare parts I had. I apologize for the delay for anyone who's been interested.

Here are the results;
==================================================
Hardware/Software
3 node CEPH cluster, 3 OSDs (one OSD per node)
----------------------------------------------
CPU = 1x E5-2670 v1
RAM = 8GB
OS Disk = 500GB SATA
OSD = 900GB 10k SAS (sdc - whole device) Journal = Shared Intel SSD DC3500 80GB (sdb1 - 10GB partition) ZFS log = Shared Intel SSD DC3500 80GB (sdb2 - 4GB partition) ZFS L2ARC = Intel SSD 320 40GB (sdd - whole device)
---------
ceph 0.87
ZoL 0.63
CentOS 7.0

2 node KVM/Openstack cluster
----------------------------
CPU = 2x Xeon X5650
RAM = 24 GB
OS Disk = 500GB SATA
-------------
Ubuntu 14.04
OpenStack Juno

the rough performance of this oddball sized test ceph cluster is 8k 1000-1500 IOPS 

==================================================
Compression; (cut out unneeded details)
Various Debian and CentOS images, with lots of test SVN and GIT data KVM/OpenStack

[root@ceph03 ~]# zfs get all SAS1
NAME  PROPERTY              VALUE                  SOURCE
SAS1  used                  586G                   -
SAS1  compressratio         1.50x                  -
SAS1  recordsize            32K                    local
SAS1  checksum              on                     default
SAS1  compression           lz4                    local
SAS1  refcompressratio      1.50x                  -
SAS1  written               586G                   -
SAS1  logicalused           877G                   -

==================================================
Dedupe; (dedupe is enabled on a dataset level but can dedupe space savings only be viewed at a pool level - bit odd I know) Various Debian and CentOS images, with lots of test SVN and GIT data KVM/OpenStack

[root@ceph01 ~]# zpool get all SAS1
NAME  PROPERTY               VALUE                  SOURCE
SAS1  size                   836G                   -
SAS1  capacity               70%                    -
SAS1  dedupratio             1.02x                  -
SAS1  free                   250G                   -
SAS1  allocated              586G                   -

==================================================
Bitrot/Corruption;
Injected random data to random locations (changed seek to random value) of sdc with;

dd if=/dev/urandom of=/dev/sdc seek=54356 bs=4k count=1

Results;

1. ZFS detects error on disk affecting PG files, being as this is a single vdev (no zraid or mirror) it cannot automatically fix. It blocks all(but delete) access to the entire files(inaccessible). 
*note: I ran this after status after already repairing 2 PGs (5.15 and 5.25), ZFS status will no longer list filename after it has been repaired/deleted/cleared*

--------

[root@ceph01 ~]# zpool status -v
  pool: SAS1
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub in progress since Thu Apr  9 13:04:54 2015
    153G scanned out of 586G at 40.3M/s, 3h3m to go
    0 repaired, 26.05% done
config:

        NAME          STATE     READ WRITE CKSUM
        SAS1          ONLINE       0     0    35
          sdc         ONLINE       0     0    70
        logs
          sdb2        ONLINE       0     0     0
        cache
          sdd         ONLINE       0     0     0

errors: Permanent errors have been detected in the following files: 
        /SAS1/current/5.e_head/DIR_E/DIR_0/DIR_6/rbd\udata.2ba762ae8944a.00000000000024cc__head_6153260E__5

--------

2. CEPH-OSD cannot read PG file. Kicks off scrub/deep-scrub

--------

/var/log/ceph/ceph-osd.2.log
2015-04-09 13:10:18.319312 7fcbb163a700 -1 log_channel(default) log [ERR] : 5.18 shard 1: soid cd635018/rbd_data.93d1f74b0dc51.00000000000018ee/head//5 candidate had a read error, digest 1835988768 != known digest 473354757
2015-04-09 13:11:38.587014 7fcbb1e3b700 -1 log_channel(default) log [ERR] : 5.18 deep-scrub 0 missing, 1 inconsistent objects
2015-04-09 13:11:38.587020 7fcbb1e3b700 -1 log_channel(default) log [ERR] : 5.18 deep-scrub 1 errors

/var/log/ceph/ceph-osd.1.log
2015-04-09 13:11:43.640499 7fe10b3c5700 -1 log_channel(default) log [ERR] : 5.25 shard 1: soid 73eb0125/rbd_data.5315b2ae8944a.0000000000005348/head//5 candidate had a read error, digest 1522345897 != known digest 1180025616
2015-04-09 13:12:44.781546 7fe10abc4700 -1 log_channel(default) log [ERR] : 5.25 deep-scrub 0 missing, 1 inconsistent objects
2015-04-09 13:12:44.781553 7fe10abc4700 -1 log_channel(default) log [ERR] : 5.25 deep-scrub 1 errors

-------

3. CEPH STATUS reports an error

-------

[root@client01 ~]# ceph status

    cluster e93ce4d3-3a46-4082-9ec5-e23c82ca616e
     health HEALTH_WARN 2 pgs inconsistent; 2 scrub errors; noout flag(s) set
     monmap e2: 3 mons at {ceph01=10.10.10.101:6789/0,ceph02=10.10.10.102:6789/0,ceph03=10.10.10.103:6789/0}, election epoch 146, quorum 0,1,2 ceph01,ceph02,ceph03
     osdmap e3178: 3 osds: 3 up, 3 in
            flags noout
      pgmap v890949: 392 pgs, 6 pools, 931 GB data, 249 kobjects
            1756 GB used, 704 GB / 2460 GB avail
                   2 active+clean+inconsistent
                 391 active+clean
  client io 0 B/s rd, 7920 B/s wr, 3 op/s

--------

3. Repair must be manually kicked off

--------

[root@client01 ~]# ceph pg repair 5.18
instructing pg 5.18 on osd.0 to repair

[root@client01 ~]# ceph health detail
HEALTH_WARN 1 pgs repair; noout flag(s) set pg 5.25 is active+clean+inconsistent, acting [1,0,2] pg 5.18 is active+clean+scrubbing+deep+repair, acting [2,0,1]

/var/log/ceph/ceph-osd.2.log
2015-04-09 13:30:01.609756 7fcbb163a700 -1 log_channel(default) log [ERR] : 5.18 shard 1: soid cd635018/rbd_data.93d1f74b0dc51.00000000000018ee/head//5 candidate had a read error, digest 1835988768 != known digest 473354757
2015-04-09 13:30:41.834465 7fcbb1e3b700 -1 log_channel(default) log [ERR] : 5.18 repair 0 missing, 1 inconsistent objects
2015-04-09 13:30:41.834479 7fcbb1e3b700 -1 log_channel(default) log [ERR] : 5.18 repair 1 errors, 1 fixed

/var/log/ceph/ceph-osd.1.log
2015-04-09 13:30:47.952742 7fe10b3c5700 -1 log_channel(default) log [ERR] : 5.25 shard 1: soid 73eb0125/rbd_data.5315b2ae8944a.0000000000005348/head//5 candidate had a read error, digest 1522345897 != known digest 1180025616
2015-04-09 13:31:23.389095 7fe10b3c5700 -1 log_channel(default) log [ERR] : 5.25 repair 0 missing, 1 inconsistent objects
2015-04-09 13:31:23.389112 7fe10b3c5700 -1 log_channel(default) log [ERR] : 5.25 repair 1 errors, 1 fixed

--------

Conclusion;

ZFS compression works GREAT, between 30-50% compression depending on data (I was getting around 30-35% with only OS images, once I loaded on real test data (SVN/GIT/etc) this increased to 50%).

ZFS dedupe doesn't seem to get you much at least with how CEPH works. Maybe due to my recordsize (32K)?

ZFS/CEPH bitrot/corruption protection isn't fully automated but still pretty damn good in my opinion, an improvement over silent bitrot of "coin tossing" of other filesystems if CEPH somehow detects an error. CEPH attempts accessing the file, ZFS detects error and basically kills access to the file. CEPH detects this as a read error and kicks off a scrub on the PG. PG repair does not seem to happen automatically, however when manually kicked off it succeeds. 

Let me know if there's anything else or any questions people have while I have this test cluster running.

Cheers,
Michal Kozanecki | Linux Administrator | mkozanecki@xxxxxxxxxx

-----Original Message-----
From: Christian Balzer [mailto:chibi@xxxxxxx]
Sent: November-01-14 4:43 AM
To: ceph-users
Cc: Michal Kozanecki
Subject: Re:  use ZFS for OSDs

On Fri, 31 Oct 2014 16:32:49 +0000 Michal Kozanecki wrote:

> I'll test this by manually inducing corrupted data to the ZFS 
> filesystem and report back how ZFS+ceph interact during a detected 
> file failure/corruption, how it recovers and any manual steps 
> required, and report back with the results.
> 
Looking forward to that.

> As for compression, using lz4 the CPU impact is around 5-20% depending 
> on load, type of I/O and I/O size, with little-to-no I/O performance 
> impact, and in fact in some cases the I/O performance actually 
> increases. I'm currently looking at a compression ratio on the ZFS 
> datasets of around 30-35% for a data consisting of rbd backed 
> OpenStack KVM VMs.

I'm looking at a similar deployment (VM images) and over 30% compression will at least negate the need of ZFS to have at least 20% free space or suffer massive degradation otherwise.

CPU usage looks acceptable, however in combination with SSD backed OSDs that's another thing to consider.
As in, is it worth to spend X amount of money for faster CPUs and 10-20% space savings or will be another SSD be cheaper?

I'm trying to position Ceph against SolidFire, who are claiming 4-10 times data reduction by a combination of compression, deduping and thin provisioning. 
Without of course quantifying things, like what step gives which reduction based on what sample data.

> I have not tried any sort of dedupe as it is memory intensive and I 
> only had 24GB of ram on each node. I'll grab some FIO benchmarks and 
> report back.
> 
I foresee a massive failure here, despite a huge potential given one use case here where all VMs are basically identical (KSM is very effective with those, too).
Why the predicted failure? Several reasons:

1. Deduping is only local, per OSD. 
That will make a big dent, but with many nearly identical VM images we should still have a quite a bit of identical data per OSD. However...

2. Data alignment.
The default RADOS objects making up images are 4MB. Which, given my limited knowledge of ZFS, I presume will be mapped to 128KB ZFS blocks which are then subject to the deduping process. 
However even if one were to install the same OS on the same sized RBD images I predict subtle differences in alignment within those objects and thus ZFS blocks.
That becomes a near certainty when those images (OS installs) are customized, files being added or deleted, etc.

3. ZFS block size and VM FS metadata.
Even if all the data would be perfectly, identically aligned in the 4MB RADOS objects the resulting 128KB ZFS blocks are likely to contain metadata like inodes (creation time) in them, thus making them subtly different and not eligible for deduping. 

OTOH SolidFire claims to be doing global deduplication, how they do that efficiently is a bit beyond, especially given the memory sizes of their appliances. My guess is they keep a map on disk (all SSDs) on each node instead of keeping it in RAM. 
I suppose the updates (writes to the SSDs) of this map are still substantially less than the data otherwise written w/o deduping.

Thusly I think Ceph will need a similar approach for any deduping to work, in combination with a much finer grained "block size". 
The later I believe is already being discussed in the context of cache tier pools, having to promote/demote 4MB blobs for a single hot 4KB of data is hardly efficient.

Regards,

Christian

> Cheers,
> 
> 
> 
> -----Original Message-----
> From: Christian Balzer [mailto:chibi@xxxxxxx]
> Sent: October-30-14 4:12 AM
> To: ceph-users
> Cc: Michal Kozanecki
> Subject: Re:  use ZFS for OSDs
> 
> On Wed, 29 Oct 2014 15:32:57 +0000 Michal Kozanecki wrote:
> 
> [snip]
> > With Ceph handling the
> > redundancy at the OSD level I saw no need for using ZFS mirroring or 
> > zraid, instead if ZFS detects corruption instead of self-healing it 
> > sends a read failure of the pg file to ceph, and then ceph's scrub 
> > mechanisms should then repair/replace the pg file using a good 
> > replica elsewhere on the cluster. ZFS + ceph are a beautiful bitrot 
> > fighting match!
> > 
> Could you elaborate on that? 
> AFAIK Ceph currently has no way to determine which of the replicas is 
> "good", one such failed PG object will require you to do a manual 
> repair after the scrub and hope that two surviving replicas (assuming 
> a size of
> 3) are identical. If not, start tossing a coin. Ideally Ceph would 
> have a way to know what happened (as in, it's a checksum and not a 
> real I/O
> error) and do a rebuild of that object itself.
> 
> On an other note, have you done any tests using the ZFS compression?
> I'm wondering what the performance impact and efficiency are.
> 
> Christian

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com