This is for us peeps using Ceph with
VMWare.
My current favoured solution for
consuming Ceph in VMWare is via RBD’s formatted with XFS and
exported via NFS to ESXi. This seems to perform better than
iSCSI+VMFS which seems to not play nicely with Ceph’s PG
contention issues particularly if working with thin
provisioned VMDK’s.
I’ve still been noticing some performance
issues however, mainly noticeable when doing any form of
storage migrations. This is largely due to the way vSphere
transfers VM’s in 64KB IO’s at a QD of 32. vSphere does this
so Arrays with QOS can balance the IO easier than if larger
IO’s were submitted. However Ceph’s PG locking means that
only one or two of these IO’s can happen at a time,
seriously lowering throughput. Typically you won’t be able
to push more than 20-25MB/s during a storage migration
There is also another issue in that the
IO needed for the XFS journal on the RBD, can cause
contention and effectively also means every NFS write IO
sends 2 down to Ceph. This can have an impact on latency as
well. Due to possible PG contention caused by the XFS
journal updates when multiple IO’s are in flight, you
normally end up making more and more RBD’s to try and spread
the load. This normally means you end up having to do
storage migrations…..you can see where I’m getting at here.
I’ve been thinking for a while that
CephFS works around a lot of these limitations.
1. It
supports fancy striping, so should mean there is less per
object contention
2. There
is no FS in the middle to maintain a journal and other
associated IO
3. A
single large NFS mount should have none of the disadvantages
seen with a single RBD
4. No
need to migrate VM’s about because of #3
5. No
need to fstrim after deleting VM’s
6. Potential
to do away with pacemaker and use LVS to do active/active
NFS as ESXi does its own locking with files
With this in mind I exported a CephFS
mount via NFS and then mounted it to an ESXi host as a test.
Initial results are looking very good.
I’m seeing storage migrations to the NFS mount going at over
200MB/s, which equates to several thousand IO’s and seems to
be writing at the intended QD32.
I need to do more testing to make sure
everything works as intended, but like I say, promising
initial results.
Further testing needs to be done to see
what sort of MDS performance is required, I would imagine
that since we are mainly dealing with large files, it might
not be that critical. I also need to consider the stability
of CephFS, RBD is relatively simple and is in use by a large
proportion of the Ceph community. CephFS is a lot easier to
“upset”.
Nick