Hello Andrei, I'm trying to accomplish the same thing with VMWare. So far I'm still doing lab testing, but we've gotten as far as simulating a production workload. Forgive the lengthy reply, I happen to be sitting on an airplane . My existing solution is using NFS servers running in ESXi VMs. Each VM serves one or two large (2-4 TB) rbd images. These images are for vmdk storage as well as oracle RAC disks. I tested using multiple NFS servers serving a single rbd, but kept on seeing xfs corruption (which was recoverable with xfs_repair). I initially blamed ceph, but eventually realized that the problem is actually with xfs; well in fact, the problem was with my configuration. It is generally a very bad idea to write to the same xfs file system from two separate computers, whether it is to a ceph rbd or to a physical disk in a shared disk array. What would be required would be a way to synchronize writes between the servers mounting the rbd. There are protocols available to do this, but all of them would introduce more latency, which I'm already struggling to control. My environment is all Cisco UCS hardware. C240 rack mount servers for OSDs and B200 blade servers for VMWare ESXi. The entire network is 10Gb or better. After carefully examining my nfs servers (which are VMs running in ESXi on local storage), I found that I had a tremendous amount of kernel IO. This was because of the high volume of TCP packets it had to constantly process for BOTH the NFS traffic and the ceph traffic. One thing that helped was to enable jumbo frames on every device in the path from ESXi to the OSDs. This is not as simple as it sounds. In ESXi, the vmk port and the vSwitch the vmk is on must have the mtu set to 9000. In the switches, the VLANs and the interfaces need to have the mtu set to 9128 (don't forget about vlan tagging overhead). In the UCSM (Cisco GUI for configuring the Blades and networking), all the vnics and the qos policies must be set to 9000. The Linux interfaces in the nfs servers, mons, and osds all needed to be set to 9000 as well. My kernel io was still high, so I just gave the NFS VM more vCPUs (8 vCPUs, 8 GB RAM). This helped as well. With that all in place, my lab environment is doing a sustained 200 iops bursting up to 500 iops (from VMWare's perspective) on one NFS server VM. The IO is mostly small writes. My lab cluster just has 11 osds in a single node. I have 3x replication as well, so the cluster is actually doing more like 600 - 1400 iops. The osds have an LSI 2208 controller (2GB cache) with each disk in separate single disk RAID1 virtual drives (necessary to take advantage of the write back cache). The OSDs have no separate journal; which means the disks are actually writing at 1200 - 2800 iops (journal + data). Not bad for one node with 11x 7k disks. I still have high latency (though it is much better than before enabling jumbo frames). VMWare shows between 10,000 microseconds and 200,000 microseconds of latency. That is acceptable for this application. IO is mostly asynchronous: alarming/logging writes, database updates. I don't notice the latency on the VMs running in the ceph-NFS datastore. I believe the latency is actually from the osd node being pretty much maxed out. I have 4 more osd servers on order to hopefully smooth out the latency spikes. One huge problem with the NFS server gateway approach is that you have many layers of file systems that are introduced in each OS. My current solution's file system stack looks like this: ext4 - VMs file systems VMFS - ESXi NFS - between ESXi and nfs server XFS - NFS server to mounted rbd disk Rados - NFS server ceph kernel client to OSDs XFS - OSDs to local file system Yuck! Four journaling file systems to write through: VMFS, XFS, OSD, XFS. Clearly the best approach would be for the VMs to directly access the ceph cluster: ext4 - VMs file systems Rados - VM ceph kernel client to OSDs XFS - OSDs to local file system Due to the packaging/deployment procedure of my application (and the ancient RHEL 5 kernel), that won't be possible any time soon. The application will be migrated to openstack, off of VMWare, first. Since I'm using UCS hardware, there is native FCoE built in (with FC frame offload and I can even boot off of FCoE); I am going to build a pair of fiber channel gateways to replace the NFS server. The the filesystem stack will look like this: ext4 - VMs file systems VMFS - ESXi FC - between UCS vHBA and FC Target Rados - FC target via LIO, ceph kernel client to OSDs XFS - OSDs to local file system I had some issues with getting a B200 blade to work in FC target mode (it was only designed to be an initiator), so I'll have to use a C240 in independent mode connected to a nexus 5k switch. As an alternative (while I wait for my new osd nodes and nexus switches to arrive), I was interested in trying tgt with fcoe. I've seen some negative performance reports due to using userland ceph client vs kernel client. More importantly, I haven't heard of anyone using fcoe, only iscsi. I may try iscsi anyway, just to see if it will perform better than my nfs solution. The key thing to remember is that you only want to have the VMFS filesystem shared. It is designed to be used by multiple client simultaneously. From what I can gather, there is no issue sharing raw rbd disks between multiple ceph clients. Note that you do want to disable caching if you use a user space rbd client. I'm not sure if FC or FCoE is an option for you Andrei. FC has been allowing shared disk access from multiple servers to redundant target machines for a very long time. Depending on your network, FCoE may be a good fit. Many production ESXi hosts use FC for storage access, so it may simply be a matter of installing a few FC HBAs in a couple of modest linux servers and connecting them into your existing FC network. Jake On Monday, May 12, 2014, Andrei Mikhailovsky <andrei at arhont.com> wrote: > Hello guys, > > I am currently running a ceph cluster for running vms with qemu + rbd. It > works pretty well and provides a good degree of failover. I am able to run > maintenance tasks on the ceph nodes without interrupting vms IO. > > I would like to do the same with VMWare / XenServer hypervisors, but I am > not really sure how to achieve this. Initially I thought of using iscsi > multipathing, however, as it turns out, multipathing is more for load > balancing and nic/switch failure. It does not allow me to perform > maintenance on the iscsi target without interrupting service to vms. > > Has anyone done either a PoC or better a production environment where > they've used ceph as a backend storage with vmware / xenserver? The > important element for me is to have the ability of performing maintenance > tasks and resilience to failovers without interrupting IO to vms. Are > there any recommendations or howtos on how this could be achieved? > > Many thanks > > Andrei > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140512/e0dd5166/attachment.htm>