Hi Michal, Really nice work on the ZFS testing. I've been thinking about this myself from time to time, However I wasn't sure if ZoL was ready to use in production with Ceph. I would like to see instead of using multiple osd's in zfs/ceph but running say a z+2 for say 8-12 3-4TB spinners and leverage some nice SSD's maybe a P3700 400GB for the zil/l2arc with compression and going back to 2x replicas which then this could give us some pretty fast/safe/efficient storage. Now to find that money tree. Regards, Quenten Grasso -----Original Message----- From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Michal Kozanecki Sent: Friday, 10 April 2015 5:15 AM To: Christian Balzer; ceph-users Subject: Re: use ZFS for OSDs I had surgery and have been off for a while. Had to rebuild test ceph+openstack cluster with whatever spare parts I had. I apologize for the delay for anyone who's been interested. Here are the results; ================================================== Hardware/Software 3 node CEPH cluster, 3 OSDs (one OSD per node) ---------------------------------------------- CPU = 1x E5-2670 v1 RAM = 8GB OS Disk = 500GB SATA OSD = 900GB 10k SAS (sdc - whole device) Journal = Shared Intel SSD DC3500 80GB (sdb1 - 10GB partition) ZFS log = Shared Intel SSD DC3500 80GB (sdb2 - 4GB partition) ZFS L2ARC = Intel SSD 320 40GB (sdd - whole device) --------- ceph 0.87 ZoL 0.63 CentOS 7.0 2 node KVM/Openstack cluster ---------------------------- CPU = 2x Xeon X5650 RAM = 24 GB OS Disk = 500GB SATA ------------- Ubuntu 14.04 OpenStack Juno the rough performance of this oddball sized test ceph cluster is 8k 1000-1500 IOPS ================================================== Compression; (cut out unneeded details) Various Debian and CentOS images, with lots of test SVN and GIT data KVM/OpenStack [root@ceph03 ~]# zfs get all SAS1 NAME PROPERTY VALUE SOURCE SAS1 used 586G - SAS1 compressratio 1.50x - SAS1 recordsize 32K local SAS1 checksum on default SAS1 compression lz4 local SAS1 refcompressratio 1.50x - SAS1 written 586G - SAS1 logicalused 877G - ================================================== Dedupe; (dedupe is enabled on a dataset level but can dedupe space savings only be viewed at a pool level - bit odd I know) Various Debian and CentOS images, with lots of test SVN and GIT data KVM/OpenStack [root@ceph01 ~]# zpool get all SAS1 NAME PROPERTY VALUE SOURCE SAS1 size 836G - SAS1 capacity 70% - SAS1 dedupratio 1.02x - SAS1 free 250G - SAS1 allocated 586G - ================================================== Bitrot/Corruption; Injected random data to random locations (changed seek to random value) of sdc with; dd if=/dev/urandom of=/dev/sdc seek=54356 bs=4k count=1 Results; 1. ZFS detects error on disk affecting PG files, being as this is a single vdev (no zraid or mirror) it cannot automatically fix. It blocks all(but delete) access to the entire files(inaccessible). *note: I ran this after status after already repairing 2 PGs (5.15 and 5.25), ZFS status will no longer list filename after it has been repaired/deleted/cleared* -------- [root@ceph01 ~]# zpool status -v pool: SAS1 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: scrub in progress since Thu Apr 9 13:04:54 2015 153G scanned out of 586G at 40.3M/s, 3h3m to go 0 repaired, 26.05% done config: NAME STATE READ WRITE CKSUM SAS1 ONLINE 0 0 35 sdc ONLINE 0 0 70 logs sdb2 ONLINE 0 0 0 cache sdd ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: /SAS1/current/5.e_head/DIR_E/DIR_0/DIR_6/rbd\udata.2ba762ae8944a.00000000000024cc__head_6153260E__5 -------- 2. CEPH-OSD cannot read PG file. Kicks off scrub/deep-scrub -------- /var/log/ceph/ceph-osd.2.log 2015-04-09 13:10:18.319312 7fcbb163a700 -1 log_channel(default) log [ERR] : 5.18 shard 1: soid cd635018/rbd_data.93d1f74b0dc51.00000000000018ee/head//5 candidate had a read error, digest 1835988768 != known digest 473354757 2015-04-09 13:11:38.587014 7fcbb1e3b700 -1 log_channel(default) log [ERR] : 5.18 deep-scrub 0 missing, 1 inconsistent objects 2015-04-09 13:11:38.587020 7fcbb1e3b700 -1 log_channel(default) log [ERR] : 5.18 deep-scrub 1 errors /var/log/ceph/ceph-osd.1.log 2015-04-09 13:11:43.640499 7fe10b3c5700 -1 log_channel(default) log [ERR] : 5.25 shard 1: soid 73eb0125/rbd_data.5315b2ae8944a.0000000000005348/head//5 candidate had a read error, digest 1522345897 != known digest 1180025616 2015-04-09 13:12:44.781546 7fe10abc4700 -1 log_channel(default) log [ERR] : 5.25 deep-scrub 0 missing, 1 inconsistent objects 2015-04-09 13:12:44.781553 7fe10abc4700 -1 log_channel(default) log [ERR] : 5.25 deep-scrub 1 errors ------- 3. CEPH STATUS reports an error ------- [root@client01 ~]# ceph status cluster e93ce4d3-3a46-4082-9ec5-e23c82ca616e health HEALTH_WARN 2 pgs inconsistent; 2 scrub errors; noout flag(s) set monmap e2: 3 mons at {ceph01=10.10.10.101:6789/0,ceph02=10.10.10.102:6789/0,ceph03=10.10.10.103:6789/0}, election epoch 146, quorum 0,1,2 ceph01,ceph02,ceph03 osdmap e3178: 3 osds: 3 up, 3 in flags noout pgmap v890949: 392 pgs, 6 pools, 931 GB data, 249 kobjects 1756 GB used, 704 GB / 2460 GB avail 2 active+clean+inconsistent 391 active+clean client io 0 B/s rd, 7920 B/s wr, 3 op/s -------- 3. Repair must be manually kicked off -------- [root@client01 ~]# ceph pg repair 5.18 instructing pg 5.18 on osd.0 to repair [root@client01 ~]# ceph health detail HEALTH_WARN 1 pgs repair; noout flag(s) set pg 5.25 is active+clean+inconsistent, acting [1,0,2] pg 5.18 is active+clean+scrubbing+deep+repair, acting [2,0,1] /var/log/ceph/ceph-osd.2.log 2015-04-09 13:30:01.609756 7fcbb163a700 -1 log_channel(default) log [ERR] : 5.18 shard 1: soid cd635018/rbd_data.93d1f74b0dc51.00000000000018ee/head//5 candidate had a read error, digest 1835988768 != known digest 473354757 2015-04-09 13:30:41.834465 7fcbb1e3b700 -1 log_channel(default) log [ERR] : 5.18 repair 0 missing, 1 inconsistent objects 2015-04-09 13:30:41.834479 7fcbb1e3b700 -1 log_channel(default) log [ERR] : 5.18 repair 1 errors, 1 fixed /var/log/ceph/ceph-osd.1.log 2015-04-09 13:30:47.952742 7fe10b3c5700 -1 log_channel(default) log [ERR] : 5.25 shard 1: soid 73eb0125/rbd_data.5315b2ae8944a.0000000000005348/head//5 candidate had a read error, digest 1522345897 != known digest 1180025616 2015-04-09 13:31:23.389095 7fe10b3c5700 -1 log_channel(default) log [ERR] : 5.25 repair 0 missing, 1 inconsistent objects 2015-04-09 13:31:23.389112 7fe10b3c5700 -1 log_channel(default) log [ERR] : 5.25 repair 1 errors, 1 fixed -------- Conclusion; ZFS compression works GREAT, between 30-50% compression depending on data (I was getting around 30-35% with only OS images, once I loaded on real test data (SVN/GIT/etc) this increased to 50%). ZFS dedupe doesn't seem to get you much at least with how CEPH works. Maybe due to my recordsize (32K)? ZFS/CEPH bitrot/corruption protection isn't fully automated but still pretty damn good in my opinion, an improvement over silent bitrot of "coin tossing" of other filesystems if CEPH somehow detects an error. CEPH attempts accessing the file, ZFS detects error and basically kills access to the file. CEPH detects this as a read error and kicks off a scrub on the PG. PG repair does not seem to happen automatically, however when manually kicked off it succeeds. Let me know if there's anything else or any questions people have while I have this test cluster running. Cheers, Michal Kozanecki | Linux Administrator | mkozanecki@xxxxxxxxxx -----Original Message----- From: Christian Balzer [mailto:chibi@xxxxxxx] Sent: November-01-14 4:43 AM To: ceph-users Cc: Michal Kozanecki Subject: Re: use ZFS for OSDs On Fri, 31 Oct 2014 16:32:49 +0000 Michal Kozanecki wrote: > I'll test this by manually inducing corrupted data to the ZFS > filesystem and report back how ZFS+ceph interact during a detected > file failure/corruption, how it recovers and any manual steps > required, and report back with the results. > Looking forward to that. > As for compression, using lz4 the CPU impact is around 5-20% depending > on load, type of I/O and I/O size, with little-to-no I/O performance > impact, and in fact in some cases the I/O performance actually > increases. I'm currently looking at a compression ratio on the ZFS > datasets of around 30-35% for a data consisting of rbd backed > OpenStack KVM VMs. I'm looking at a similar deployment (VM images) and over 30% compression will at least negate the need of ZFS to have at least 20% free space or suffer massive degradation otherwise. CPU usage looks acceptable, however in combination with SSD backed OSDs that's another thing to consider. As in, is it worth to spend X amount of money for faster CPUs and 10-20% space savings or will be another SSD be cheaper? I'm trying to position Ceph against SolidFire, who are claiming 4-10 times data reduction by a combination of compression, deduping and thin provisioning. Without of course quantifying things, like what step gives which reduction based on what sample data. > I have not tried any sort of dedupe as it is memory intensive and I > only had 24GB of ram on each node. I'll grab some FIO benchmarks and > report back. > I foresee a massive failure here, despite a huge potential given one use case here where all VMs are basically identical (KSM is very effective with those, too). Why the predicted failure? Several reasons: 1. Deduping is only local, per OSD. That will make a big dent, but with many nearly identical VM images we should still have a quite a bit of identical data per OSD. However... 2. Data alignment. The default RADOS objects making up images are 4MB. Which, given my limited knowledge of ZFS, I presume will be mapped to 128KB ZFS blocks which are then subject to the deduping process. However even if one were to install the same OS on the same sized RBD images I predict subtle differences in alignment within those objects and thus ZFS blocks. That becomes a near certainty when those images (OS installs) are customized, files being added or deleted, etc. 3. ZFS block size and VM FS metadata. Even if all the data would be perfectly, identically aligned in the 4MB RADOS objects the resulting 128KB ZFS blocks are likely to contain metadata like inodes (creation time) in them, thus making them subtly different and not eligible for deduping. OTOH SolidFire claims to be doing global deduplication, how they do that efficiently is a bit beyond, especially given the memory sizes of their appliances. My guess is they keep a map on disk (all SSDs) on each node instead of keeping it in RAM. I suppose the updates (writes to the SSDs) of this map are still substantially less than the data otherwise written w/o deduping. Thusly I think Ceph will need a similar approach for any deduping to work, in combination with a much finer grained "block size". The later I believe is already being discussed in the context of cache tier pools, having to promote/demote 4MB blobs for a single hot 4KB of data is hardly efficient. Regards, Christian > Cheers, > > > > -----Original Message----- > From: Christian Balzer [mailto:chibi@xxxxxxx] > Sent: October-30-14 4:12 AM > To: ceph-users > Cc: Michal Kozanecki > Subject: Re: use ZFS for OSDs > > On Wed, 29 Oct 2014 15:32:57 +0000 Michal Kozanecki wrote: > > [snip] > > With Ceph handling the > > redundancy at the OSD level I saw no need for using ZFS mirroring or > > zraid, instead if ZFS detects corruption instead of self-healing it > > sends a read failure of the pg file to ceph, and then ceph's scrub > > mechanisms should then repair/replace the pg file using a good > > replica elsewhere on the cluster. ZFS + ceph are a beautiful bitrot > > fighting match! > > > Could you elaborate on that? > AFAIK Ceph currently has no way to determine which of the replicas is > "good", one such failed PG object will require you to do a manual > repair after the scrub and hope that two surviving replicas (assuming > a size of > 3) are identical. If not, start tossing a coin. Ideally Ceph would > have a way to know what happened (as in, it's a checksum and not a > real I/O > error) and do a rebuild of that object itself. > > On an other note, have you done any tests using the ZFS compression? > I'm wondering what the performance impact and efficiency are. > > Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com