Re: Ceph + VMWare

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Patrick,

1) Université de Lorraine. (7.000 researchers and staff members, 60.000 students, 42 schools and education structures, 60 research labs).

2) RHCS cluster: 144 OSDs on 12 nodes for 520 TB raw capacity.
VMware clusters: 7 VMware clusters (40 ESXi hosts). First need is to provide capacitive storage (Ceph) to VMs running in a VMware vRA IaaS cluster (6 ESXi hosts).

3) Deployment growth ?
RHCS cluster: Initial need was 750 TB of usable storage, so a x4 growth in the next 3 years is expected to reach 1 PB of usable storage. VMware clusters: We just started to offer a IaaS service to research laboratories and education structures whithin our university. We can expect to host several hundreds of VMs in the next 2 years (~600-800).

4) Integration method ? Clearly native.
I spent some of the last 6 months working on building an HA gateway cluster (iSCSI and NFS) to provide RHCS Ceph storage to our VMware IaaS Cluster. Here are my findings:

    * iSCSI ?

Gives better performance than NFS, we know that. BUT, we cannot go into production with iSCSI because of ESXi hosts entering a never ending iSCSI 'Abort Task' loop when the Ceph cluster fails to acknowledge a 4MB IO in less than 5s, resulting in VMs crashing. I've been told by a VMware engineer that this 5s limit cannot be raised as it's hardcoded in ESXi iSCSI software initiator. Why would an IO take more than 5s ? In case of a important load on the Ceph cluster, or a Ceph failure scenario (network isolation, OSD crash), or deep-scrubbing bothering client IOs or any combination of these or those I didn't think about...

    What I have tested:
iSCSI Active/Active HA cluster. Each ESXi sees the same datastore through both targets but only accesses one datastore at a time through a statically defined prefered path. 3 ESXi work on one target, 3 ESXi work on the other. If a target goes down, the other paths are used.

- LIO iSCSI targets with kernel RBD mapping (no cache). VAAI methods. Easy to configure. Delivers good performance with eagger zeroed virtual disks. 'Abort Task' loop has the ESXi disconnect from the vCenter Server.
    Restartign the target get them back in but some VMs certainly crashed.
- FreeBSD / FreeNAS running in KVM (on top of CentOS) mapping RBD images through librbd. Found that fileio backstore was used. Found hard to make it HA with librbd cache. And still the 'Abort Task' loop... - SCST ESOS targets with kernel RBD mapping (no cache). VAAI methods, ALUA. Easy to configure too. 'Abort Task' still happens but the ESX does not get disconnected from the vCenter Server. Still targets have to be restarted to fix this situation.

    * NFS ?

Gives less performance than iSCSI, we know that too. BUT, it's probably the best option right now. It's very easy to make it HA with Pacemaker/Corosync as VMware doesn't make use of the NFS lock manager. Here is a good start : https://www.sebastien-han.fr/blog/2012/07/06/nfs-over-rbd/ We're still benchmarking IOPs to decide whether we can go into production with this infrastructure but we're actually very satisfied with the HA mechanism. Running synchronous writes on multiple VMs (on virtual disk hosted on NFS datastores with 'sync' exports of RBD images) while Storage vMotioning those multiple disks between NFS RBD datastores and flapping ViP (and thus NFS exports) from one server to the other at the same time never kills any VM nor makes any datastore unavailable. And every Storage vMotion task complete ! This is excellent results. Note that it's important to run VMware Tools in VMs as VMware Tools installation extend the write delay timeout on local iSCSI devices.

    What I have tested:
- NFS exports with async mode sharing RBD images with XFS on top of it. Gives the best performances but, as an evidence, no one will want to use this mode in production. - NFS exports with sync mode sharing RBD images with XFS on top of it. Gives mitigated performances. We would clearly announce this type of storage as capacitive and not performant through our IaaS service. As VMs caches writes, IOPS might be good enough for tier 2 or 3 applications. We would probably be able to increase the number of IOPS by using more RBD images and NFS shares. - NFS exports with sync mode sharing RBD images with ZFS (with compression) on top of it. The idea is to provide better performance by putting the SLOG (write journal) on fast SSD drives. See this real life (love-)story : https://virtualexistenz.wordpress.com/2013/02/01/using-zfs-storage-as-vmware-nfs-datastores-a-real-life-love-story/ Each NFS server has 2 mirrored SSDs (RAID1). Each NFS server export partitions of this SSD volume through iSCSI. Each NFS server is a client of local and distant iSCSI target. Then the SLOG device is made of a ZFS mirror of 2 disks : local iSCSI device and distant iSCSI device (as vdevs).

So even if a whole NFS server crashes or is permanently down, the ZFS pool can still be imported on the second NFS server.

First benchmarks show a x4 performance improvement. Further tests will help to decide whether its safe or not to go into production with this level of complexity. Still, as we're using VMware clustered datastores, it's easy to go back to classic XFS NFS datastores by putting a ZFS datastore in maintenance mode.

As for SUSE Enterprise Storage HA iSCSI targets, I doubt it can do any better regarding the 'Abort Task' command, unless they patch the ceph cluster to be able to Abort an IO which I doubt they could. From what I got, with how the ESXi iSCSI software initiator works, the Ceph cluster HAS to ACK an IO in less than 5s. Period.

Regards,

Frederic Nass.

PS : Thank you Nick for your help regarding the 'Abort Taks' loop. ;-)


Le 05/10/2016 à 20:32, Patrick McGarry a écrit :
Hey guys,

Starting to buckle down a bit in looking at how we can better set up
Ceph for VMWare integration, but I need a little info/help from you
folks.

If you currently are using Ceph+VMWare, or are exploring the option,
I'd like some simple info from you:

1) Company
2) Current deployment size
3) Expected deployment growth
4) Integration method (or desired method) ex: iscsi, native, etc

Just casting the net so we know who is interested and might want to
help us shape and/or test things in the future if we can make it
better. Thanks.



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux