Hi Patrick,
1) Université de Lorraine. (7.000 researchers and staff members, 60.000
students, 42 schools and education structures, 60 research labs).
2) RHCS cluster: 144 OSDs on 12 nodes for 520 TB raw capacity.
VMware clusters: 7 VMware clusters (40 ESXi hosts). First need is
to provide capacitive storage (Ceph) to VMs running in a VMware vRA IaaS
cluster (6 ESXi hosts).
3) Deployment growth ?
RHCS cluster: Initial need was 750 TB of usable storage, so a x4
growth in the next 3 years is expected to reach 1 PB of usable storage.
VMware clusters: We just started to offer a IaaS service to
research laboratories and education structures whithin our university.
We can expect to host several hundreds of VMs in the next 2 years
(~600-800).
4) Integration method ? Clearly native.
I spent some of the last 6 months working on building an HA gateway
cluster (iSCSI and NFS) to provide RHCS Ceph storage to our VMware IaaS
Cluster. Here are my findings:
* iSCSI ?
Gives better performance than NFS, we know that. BUT, we cannot go
into production with iSCSI because of ESXi hosts entering a never ending
iSCSI 'Abort Task' loop when the Ceph cluster fails to acknowledge a 4MB
IO in less than 5s, resulting in VMs crashing. I've been told by a
VMware engineer that this 5s limit cannot be raised as it's hardcoded in
ESXi iSCSI software initiator.
Why would an IO take more than 5s ? In case of a important load on
the Ceph cluster, or a Ceph failure scenario (network isolation, OSD
crash), or deep-scrubbing bothering client IOs or any combination of
these or those I didn't think about...
What I have tested:
iSCSI Active/Active HA cluster. Each ESXi sees the same datastore
through both targets but only accesses one datastore at a time through a
statically defined prefered path.
3 ESXi work on one target, 3 ESXi work on the other. If a target
goes down, the other paths are used.
- LIO iSCSI targets with kernel RBD mapping (no cache). VAAI
methods. Easy to configure. Delivers good performance with eagger zeroed
virtual disks. 'Abort Task' loop has the ESXi disconnect from the
vCenter Server.
Restartign the target get them back in but some VMs certainly crashed.
- FreeBSD / FreeNAS running in KVM (on top of CentOS) mapping RBD
images through librbd. Found that fileio backstore was used. Found hard
to make it HA with librbd cache. And still the 'Abort Task' loop...
- SCST ESOS targets with kernel RBD mapping (no cache). VAAI
methods, ALUA. Easy to configure too. 'Abort Task' still happens but the
ESX does not get disconnected from the vCenter Server. Still targets
have to be restarted to fix this situation.
* NFS ?
Gives less performance than iSCSI, we know that too. BUT, it's
probably the best option right now. It's very easy to make it HA with
Pacemaker/Corosync as VMware doesn't make use of the NFS lock manager.
Here is a good start :
https://www.sebastien-han.fr/blog/2012/07/06/nfs-over-rbd/
We're still benchmarking IOPs to decide whether we can go into
production with this infrastructure but we're actually very satisfied
with the HA mechanism.
Running synchronous writes on multiple VMs (on virtual disk hosted
on NFS datastores with 'sync' exports of RBD images) while Storage
vMotioning those multiple disks between NFS RBD datastores and flapping
ViP (and thus NFS exports) from one server to the other at the same time
never kills any VM nor makes any datastore unavailable.
And every Storage vMotion task complete ! This is excellent
results. Note that it's important to run VMware Tools in VMs as VMware
Tools installation extend the write delay timeout on local iSCSI devices.
What I have tested:
- NFS exports with async mode sharing RBD images with XFS on top of
it. Gives the best performances but, as an evidence, no one will want to
use this mode in production.
- NFS exports with sync mode sharing RBD images with XFS on top of
it. Gives mitigated performances. We would clearly announce this type of
storage as capacitive and not performant through our IaaS service.
As VMs caches writes, IOPS might be good enough for tier 2 or 3
applications. We would probably be able to increase the number of IOPS
by using more RBD images and NFS shares.
- NFS exports with sync mode sharing RBD images with ZFS (with
compression) on top of it. The idea is to provide better performance by
putting the SLOG (write journal) on fast SSD drives.
See this real life (love-)story :
https://virtualexistenz.wordpress.com/2013/02/01/using-zfs-storage-as-vmware-nfs-datastores-a-real-life-love-story/
Each NFS server has 2 mirrored SSDs (RAID1). Each NFS server
export partitions of this SSD volume through iSCSI.
Each NFS server is a client of local and distant iSCSI target.
Then the SLOG device is made of a ZFS mirror of 2 disks : local iSCSI
device and distant iSCSI device (as vdevs).
So even if a whole NFS server crashes or is permanently down, the
ZFS pool can still be imported on the second NFS server.
First benchmarks show a x4 performance improvement. Further tests
will help to decide whether its safe or not to go into production with
this level of complexity.
Still, as we're using VMware clustered datastores, it's easy to
go back to classic XFS NFS datastores by putting a ZFS datastore in
maintenance mode.
As for SUSE Enterprise Storage HA iSCSI targets, I doubt it can do
any better regarding the 'Abort Task' command, unless they patch the
ceph cluster to be able to Abort an IO which I doubt they could.
From what I got, with how the ESXi iSCSI software initiator works,
the Ceph cluster HAS to ACK an IO in less than 5s. Period.
Regards,
Frederic Nass.
PS : Thank you Nick for your help regarding the 'Abort Taks' loop. ;-)
Le 05/10/2016 à 20:32, Patrick McGarry a écrit :
Hey guys,
Starting to buckle down a bit in looking at how we can better set up
Ceph for VMWare integration, but I need a little info/help from you
folks.
If you currently are using Ceph+VMWare, or are exploring the option,
I'd like some simple info from you:
1) Company
2) Current deployment size
3) Expected deployment growth
4) Integration method (or desired method) ex: iscsi, native, etc
Just casting the net so we know who is interested and might want to
help us shape and/or test things in the future if we can make it
better. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html