Review of Ceph on ZFS - or how not to deploy Ceph for RBD + OpenStack

Kevin Olbrich <ko@xxxxxxx> · Tue, 10 Jan 2017 22:21:30 +0100

Dear Ceph-users,
just to make sure nobody makes the mistake, I would like to share my experience with Ceph on ZFS in our test lab.
ZFS is a Copy-on-Write filesystem and is suitable IMHO where data resilience has high priority.
I work for a mid-sized datacenter in Germany and we set up a cluster using Ceph hammer -> infernalis -> jewel 10.2.3 (upgrades during 24/7 usage).
We initialy had chosen ZFS for it's great cache (ARC) and thought it would be a great idea to use it instead of XFS (or EXT4 when it was supported).
Before we were using ZFS for Backup-Storage JBODs and made good results (performance is great!).

We then assumed that ZFS is a good choice for distributed / high availability scenarios.
Since end 2015 I was running OpenStack Liberty / Mitaka on top of this cluster and our use case were all sorts of VMs (20/80 split Win / Linux).
We are running this cluster setup for over a year now.

Details:
80x Disks (56x 500GB SATA via FC, 24x 1TB SATA via SAS) JBOD
All nodes (OpenStack and Ceph) on CentOS 7
Everything Kernel 3.10.0-x, switched to 4.4.30+ (elrepo) while upgrade to jewel
ZFSonLinux latest
5x Ceph node equipped with 32GB RAM, Intel i5, Intel DC P3700 NVMe journal, Emulex Fiber PCIe for JBOD
2x 1GBit-Bond per node with belance-alb (belancing using different MAC-address during ARP) on two switches
2x HP 2920 using 20G interconnect, then switched to 2x HP Comware 5130 using IRF-stack with 20G interconnect
Nodes had RAIDZ2 (RAID6) configuration for 14x 500GB disks (= 1 OSD node) and the 24 disk JBOD had 4x RAIDZ2 (RAID6) using 6 disks each (= 4 OSD node, only 2 in production).
90x VMs in total at the time we ended our evaluation
6 OSDs in total
pgnum 128 x 4 pools, 512 PGs total, size 2 and min_size 1
OSD filled 30 - 40%, low fragmentation
We were not using 10GBit NICs as our VM traffic would not exceed 2x GBit per node in normal operation as we expected a lot of 4k blocks from Windows Remote Services (known as "terminal server").

Pros:
Survived two outages without a single lost object (just had to do "pg repair num" on 4 PGs)
KVM VMs were frozen and OS started to reset SCSI bus until cluster was back online - no broken databases (we were running MySQL, MSSQL and Exchange)
Read-Cache using normal Samsung PRO SSDs works very well
Together with multipathd optimal redundancy and performance
Deep-Scrub is not needed as ZFS can scrub itself in RAIDZ1 and RAIDZ2 backed by checksumms
Cons:
Performance goes lower and lower with ongoing usage (we added the 1TB disks JBOD to accommodate this issue) but lately hit it again.
Disks spin at 100% all the time in the 14x 500G JBODs, 30% at the SAS-JBOD - mostly related to COW
Even a little bit of fragmentation results in slow downs
If Deep-Scrub is enabled, IO stucks very often
noout-flag needs to be set to stop recovery storm (which is bad as a recovery of a single 500GB OSD is great while 6 TB takes a very long time)

We moved from Hammer in 2015 to Infernalis in early 2016 and to Jewel in Oct 2016. During upgrade to Jewel, we moved to the elrepo.org kernel-lt package and upgraded from kernel 3.10.0 to 4.4.30+.
Migration from Infernalis to Jewel was noticeable, most VMs were running a lot faster but we also had a great increase of stuck requests. I am not sure but I did not notice any on Infernalis.

We experienced a lot of io blocks (X requests blocked > 32 sec) when a lot of data is changed in cloned RBDs (disk imported via OpenStack Glance, cloned during instance creation by Cinder).
If the disk was cloned some months ago and large software updates are applied (a lot of small files) combined with a lot of syncs, we often had a node hit suicide timeout.
Most likely this is a problem with op thread count, as it is easy to block threads with RAIDZ2 (RAID6) if many small operations are written to disk (again, COW is not optimal here).
When recovery took place (0.020% degraded) the cluster performance was very bad - remote service VMs (Windows) were unusable. Recovery itself was using 70 - 200 mb/s which was okay.

Read did not cause any problems. We made a lot of backups of the running VMs during the day and performance in other VMs was slightly lowered - nothing we realy worried about.
All in all read performance was okay while write performance was awful as soon as filestore flush kicked in (= some seconds when downloading stuff via GBit to the VM).
Scrub and Deepscrub needed to be disabled to remain "normal operation" - this is the worst point about this setup.

In data resilience terms we were very satisfied. We had one node crashing regulary with Infernalis (we never found the reason after 3 days) before we upgraded to Jewel and no data was corrupted when this happend (especially MS Exchange did not complain!).
After we upgraded to Jewel, it did not crash again. In all cases, VMs were fully functional.

Currently we are migrating most VMs out of the cluster to shut it down (we had some semi-productive VMs on it to get real world usage stats).

I just wanted to let you know which problems we had with Ceph on ZFS. No doubt we made a lot of mistakes (this was our first Ceph cluster) but we had a lot of tests running on it and would not recommand to use ZFS as the backend.

And for those interested in monitoring this type of cluster: Do not use munin. As the disks were spinning at 100% and each disk is seen three times (2 paths combined in one mpath) I caused a deadlock resulting in 3/4 offline nodes (one of the disasters we had Ceph repair everything).

I hope this helps all Ceph users who are interested in the idea of running Ceph on ZFS.

Kind regards,
Kevin Olbrich.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com