Ceph with VMWare / XenServer

jak3kaj@xxxxxxxxx (Jake Young) · Mon, 12 May 2014 09:23:39 -0400

Hello Andrei,

I'm trying to accomplish the same thing with VMWare. So far I'm still doing
lab testing, but we've gotten as far as simulating a production workload.
 Forgive the lengthy reply, I happen to be sitting on an airplane .

My existing solution is using NFS servers running in ESXi VMs. Each VM
serves one or two large (2-4 TB) rbd images. These images are for vmdk
storage as well as oracle RAC disks.

I tested using multiple NFS servers serving a single rbd, but kept on
seeing xfs corruption (which was recoverable with xfs_repair). I initially
blamed ceph, but eventually realized that the problem is actually with xfs;
well in fact, the problem was with my configuration. It is generally a very
bad idea to write to the same xfs file system from two separate computers,
whether it is to a ceph rbd or to a physical disk in a shared disk array.
What would be required would be a way to synchronize writes between the
servers mounting the rbd. There are protocols available to do this, but all
of them would introduce more latency, which I'm already struggling to
control.

My environment is all Cisco UCS hardware. C240 rack mount servers for OSDs
and B200 blade servers for VMWare ESXi. The entire network is 10Gb or
better.  After carefully examining my nfs servers (which are VMs running in
ESXi on local storage), I found that I had a tremendous amount of kernel
IO. This was because of the high volume of TCP packets it had to constantly
process for BOTH the NFS traffic and the ceph traffic.

One thing that helped was to enable jumbo frames on every device in the
path from ESXi to the OSDs. This is not as simple as it sounds. In ESXi,
the vmk port and the vSwitch the vmk is on must have the mtu set to 9000.
In the switches, the VLANs and the interfaces need to have the mtu set to
9128 (don't forget about vlan tagging overhead). In the UCSM (Cisco GUI for
configuring the Blades and networking), all the vnics and the qos policies
must be set to 9000. The Linux interfaces in the nfs servers, mons, and
osds all needed to be set to 9000 as well.

My kernel io was still high, so I just gave the NFS VM more vCPUs (8
vCPUs, 8 GB RAM).  This helped as well.

With that all in place, my lab environment is doing a sustained 200 iops
bursting up to 500 iops (from VMWare's perspective) on one NFS server VM.
The IO is mostly small writes. My lab cluster just has 11 osds in a single
node.  I have 3x replication as well, so the cluster is actually doing more
like 600 - 1400 iops. The osds have an LSI 2208 controller (2GB cache) with
each disk in separate single disk RAID1 virtual drives (necessary to take
advantage of the write back cache). The OSDs have no separate journal;
which means the disks are actually writing at 1200 - 2800 iops (journal +
data). Not bad for one node with 11x 7k disks.

I still have high latency (though it is much better than before enabling
jumbo frames). VMWare shows between 10,000 microseconds and 200,000
microseconds of latency.  That is acceptable for this application.  IO is
mostly asynchronous: alarming/logging writes, database updates. I don't
notice the latency on the VMs running in the ceph-NFS datastore.

I believe the latency is actually from the osd node being pretty much maxed
out. I have 4 more osd servers on order to hopefully smooth out the latency
spikes.

One huge problem with the NFS server gateway approach is that you have many
layers of file systems that are introduced in each OS. My current
solution's file system stack looks like this:

ext4 - VMs file systems
VMFS - ESXi
NFS - between ESXi and nfs server
XFS - NFS server to mounted rbd disk
Rados - NFS server ceph kernel client to OSDs
XFS - OSDs to local file system

Yuck!  Four journaling file systems to write through: VMFS, XFS, OSD, XFS.

Clearly the best approach would be for the VMs to directly access the ceph
cluster:

ext4 - VMs file systems
Rados - VM ceph kernel client to OSDs
XFS - OSDs to local file system

Due to the packaging/deployment procedure of my application (and the
ancient RHEL 5 kernel), that won't be possible any time soon. The
application will be migrated to openstack, off of VMWare, first.

Since I'm using UCS hardware, there is native FCoE built in (with FC frame
offload and I can even boot off of FCoE); I am going to build a pair
of fiber channel gateways to replace the NFS server. The the filesystem
stack will look like this:

ext4 - VMs file systems
VMFS - ESXi
FC - between UCS vHBA and FC Target
Rados - FC target via LIO, ceph kernel client to OSDs
XFS - OSDs to local file system

I had some issues with getting a B200 blade to work in FC target mode (it
was only designed to be an initiator), so I'll have to use a C240 in
independent mode connected to a nexus 5k switch.

As an alternative (while I wait for my new osd nodes and nexus switches to
arrive), I was interested in trying tgt with fcoe. I've seen some negative
performance reports due to using userland ceph client vs kernel client.
More importantly, I haven't heard of anyone using fcoe, only iscsi. I may
try iscsi anyway, just to see if it will perform better than my nfs
solution.

The key thing to remember is that you only want to have the VMFS filesystem
shared. It is designed to be used by multiple client simultaneously. From
what I can gather, there is no issue sharing raw rbd disks between multiple
ceph clients. Note that you do want to disable caching if you use a user
space rbd client.

I'm not sure if FC or FCoE is an option for you Andrei. FC has been
allowing shared disk access from multiple servers to redundant target
machines for a very long time. Depending on your network, FCoE may be a
good fit.  Many production ESXi hosts use FC for storage access, so it may
simply be a matter of installing a few FC HBAs in a couple of modest linux
servers and connecting them into your existing FC network.

Jake

On Monday, May 12, 2014, Andrei Mikhailovsky <andrei at arhont.com> wrote:

> Hello guys,
>
> I am currently running a ceph cluster for running vms with qemu + rbd. It
> works pretty well and provides a good degree of failover. I am able to run
> maintenance tasks on the ceph nodes without interrupting vms IO.
>
> I would like to do the same with VMWare / XenServer hypervisors, but I am
> not really sure how to achieve this. Initially I thought of using iscsi
> multipathing, however, as it turns out, multipathing is more for load
> balancing and nic/switch failure. It does not allow me to perform
> maintenance on the iscsi target without interrupting service to vms.
>
> Has anyone done either a PoC or better a production environment where
> they've used ceph as a backend storage with vmware / xenserver? The
> important element for me is to have the ability of performing maintenance
> tasks and resilience to failovers without interrupting IO to vms.  Are
> there any recommendations or howtos on how this could be achieved?
>
> Many thanks
>
> Andrei
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140512/e0dd5166/attachment.htm>