Re: Ceph + VMWare

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Tue, 11 Oct 2016 17:13:34 +0200

Hi Patrick,

1) Université de Lorraine. (7.000 researchers and staff members, 60.000 
students, 42 schools and education structures, 60 research labs).

2) RHCS cluster: 144 OSDs on 12 nodes for 520 TB raw capacity.
    VMware clusters: 7 VMware clusters (40 ESXi hosts). First need is 
to provide capacitive storage (Ceph) to VMs running in a VMware vRA IaaS 
cluster (6 ESXi hosts).

3) Deployment growth ?
    RHCS cluster: Initial need was 750 TB of usable storage, so a x4 
growth in the next 3 years is expected to reach 1 PB of usable storage.
    VMware clusters: We just started to offer a IaaS service to 
research laboratories and education structures whithin our university.
    We can expect to host several hundreds of VMs in the next 2 years 
(~600-800).

4) Integration method ? Clearly native.
    I spent some of the last 6 months working on building an HA gateway 
cluster (iSCSI and NFS) to provide RHCS Ceph storage to our VMware IaaS 
Cluster. Here are my findings:

    * iSCSI ?

    Gives better performance than NFS, we know that. BUT, we cannot go 
into production with iSCSI because of ESXi hosts entering a never ending 
iSCSI 'Abort Task' loop when the Ceph cluster fails to acknowledge a 4MB 
IO in less than 5s, resulting in VMs crashing. I've been told by a 
VMware engineer that this 5s limit cannot be raised as it's hardcoded in 
ESXi iSCSI software initiator.
    Why would an IO take more than 5s ? In case of a important load on 
the Ceph cluster, or a Ceph failure scenario (network isolation, OSD 
crash), or deep-scrubbing bothering client IOs or any combination of 
these or those I didn't think about...

    What I have tested:
    iSCSI Active/Active HA cluster. Each ESXi sees the same datastore 
through both targets but only accesses one datastore at a time through a 
statically defined prefered path.
    3 ESXi work on one target, 3 ESXi work on the other. If a target 
goes down, the other paths are used.

    - LIO iSCSI targets with kernel RBD mapping (no cache). VAAI 
methods. Easy to configure. Delivers good performance with eagger zeroed 
virtual disks. 'Abort Task' loop has the ESXi disconnect from the 
vCenter Server.
    Restartign the target get them back in but some VMs certainly crashed.
    - FreeBSD / FreeNAS running in KVM (on top of CentOS) mapping RBD 
images through librbd. Found that fileio backstore was used. Found hard 
to make it HA with librbd cache. And still the 'Abort Task' loop...
    - SCST ESOS targets with kernel RBD mapping (no cache). VAAI 
methods, ALUA. Easy to configure too. 'Abort Task' still happens but the 
ESX does not get disconnected from the vCenter Server. Still targets 
have to be restarted to fix this situation.

    * NFS ?

    Gives less performance than iSCSI, we know that too. BUT, it's 
probably the best option right now. It's very easy to make it HA with 
Pacemaker/Corosync as VMware doesn't make use of the NFS lock manager. 
Here is a good start : 
https://www.sebastien-han.fr/blog/2012/07/06/nfs-over-rbd/
    We're still benchmarking IOPs to decide whether we can go into 
production with this infrastructure but we're actually very satisfied 
with the HA mechanism.
    Running synchronous writes on multiple VMs (on virtual disk hosted 
on NFS datastores with 'sync' exports of RBD images) while Storage 
vMotioning those multiple disks between NFS RBD datastores and flapping 
ViP (and thus NFS exports) from one server to the other at the same time 
never kills any VM nor makes any datastore unavailable.
    And every Storage vMotion task complete ! This is excellent 
results. Note that it's important to run VMware Tools in VMs as VMware 
Tools installation extend the write delay timeout on local iSCSI devices.

    What I have tested:
    - NFS exports with async mode sharing RBD images with XFS on top of 
it. Gives the best performances but, as an evidence, no one will want to 
use this mode in production.
    - NFS exports with sync mode sharing RBD images with XFS on top of 
it. Gives mitigated performances. We would clearly announce this type of 
storage as capacitive and not performant through our IaaS service.
      As VMs caches writes, IOPS might be good enough for tier 2 or 3 
applications. We would probably be able to increase the number of IOPS 
by using more RBD images and NFS shares.
    - NFS exports with sync mode sharing RBD images with ZFS (with 
compression) on top of it. The idea is to provide better performance by 
putting the SLOG (write journal) on fast SSD drives.
      See this real life (love-)story : 
https://virtualexistenz.wordpress.com/2013/02/01/using-zfs-storage-as-vmware-nfs-datastores-a-real-life-love-story/
      Each NFS server has 2 mirrored SSDs (RAID1). Each NFS server 
export partitions of this SSD volume through iSCSI.
      Each NFS server is a client of local and distant iSCSI target. 
Then the SLOG device is made of a ZFS mirror of 2 disks : local iSCSI 
device and distant iSCSI device (as vdevs).

      So even if a whole NFS server crashes or is permanently down, the 
ZFS pool can still be imported on the second NFS server.

      First benchmarks show a x4 performance improvement. Further tests 
will help to decide whether its safe or not to go into production with 
this level of complexity.
      Still, as we're using VMware clustered datastores, it's easy to 
go back to classic XFS NFS datastores by putting a ZFS datastore in 
maintenance mode.

    As for SUSE Enterprise Storage HA iSCSI targets, I doubt it can do 
any better regarding the 'Abort Task' command, unless they patch the 
ceph cluster to be able to Abort an IO which I doubt they could.
    From what I got, with how the ESXi iSCSI software initiator works, 
the Ceph cluster HAS to ACK an IO in less than 5s. Period.

Regards,

Frederic Nass.

PS : Thank you Nick for your help regarding the 'Abort Taks' loop. ;-)

Le 05/10/2016 à 20:32, Patrick McGarry a écrit :
Hey guys,

Starting to buckle down a bit in looking at how we can better set up
Ceph for VMWare integration, but I need a little info/help from you
folks.

If you currently are using Ceph+VMWare, or are exploring the option,
I'd like some simple info from you:

1) Company
2) Current deployment size
3) Expected deployment growth
4) Integration method (or desired method) ex: iscsi, native, etc

Just casting the net so we know who is interested and might want to
help us shape and/or test things in the future if we can make it
better. Thanks.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com