Re: CephFS+NFS For VMWare

David C <dcsysengineer@xxxxxxxxx> · Mon, 2 Jul 2018 19:41:36 +0100

On Sat, 30 Jun 2018, 21:48 Nick Fisk, <nick@xxxxxxxxxx> wrote:
Hi Paul,

Thanks for your response, is there anything you can go into more detail on and share with the list? I’m sure it would be much appreciated by more than just myself.

I was planning on Kernel CephFS and NFS server, both seem to achieve better performance, although stability is of greater concern.
FWIW, a recent nfs-ganesha could be more stable than kernel nfs. I've had a fair few issues with Knfs exporting cephfs, it works fine until there is an issue with your cluster such as an mds going down or slow requests and you can end up with your nfsd processes in the dreaded uninterruptable sleep. 

Also consider CTDB for basic active/active nfs on cephfs, works fine for normal Linux clients, not sure how well it would work with esx. If you want want to use use ctdb with ganesha I think you're restricted to using the plain vfs fsal, I don't think the ceph fsal will give you the consistent file handles you need for client fail over to work properly (although could be wrong there). 

Thanks,
Nick
From: Paul Emmerich [mailto:paul.emmerich@xxxxxxxx] 
Sent: 29 June 2018 17:57
To: Nick Fisk <nick@xxxxxxxxxx>
Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
Subject: Re:  CephFS+NFS For VMWare

VMWare can be quite picky about NFS servers.
Some things that you should test before deploying anything with that in production:

* failover
* reconnects after NFS reboots or outages
* NFS3 vs NFS4
* Kernel NFS (which kernel version? cephfs-fuse or cephfs-kernel?) vs NFS Ganesha (VFS FSAL vs. Ceph FSAL)
* Stress tests with lots of VMWare clients - we had a setup than ran fine with 5 big VMWare hypervisors but started to get random deadlocks once we added 5 more

We are running CephFS + NFS + VMWare in production but we've encountered *a lot* of problems until we got that stable for a few configurations.
Be prepared to debug NFS problems at a low level with tcpdump and a careful read of the RFC and NFS server source ;)

Paul

2018-06-29 18:48 GMT+02:00 Nick Fisk <nick@xxxxxxxxxx>:
This is for us peeps using Ceph with VMWare.

My current favoured solution for consuming Ceph in VMWare is via RBD’s formatted with XFS and exported via NFS to ESXi. This seems to perform better than iSCSI+VMFS which seems to not play nicely with Ceph’s PG contention issues particularly if working with thin provisioned VMDK’s.

I’ve still been noticing some performance issues however, mainly noticeable when doing any form of storage migrations. This is largely due to the way vSphere transfers VM’s in 64KB IO’s at a QD of 32. vSphere does this so Arrays with QOS can balance the IO easier than if larger IO’s were submitted. However Ceph’s PG locking means that only one or two of these IO’s can happen at a time, seriously lowering throughput. Typically you won’t be able to push more than 20-25MB/s during a storage migration

There is also another issue in that the IO needed for the XFS journal on the RBD, can cause contention and effectively also means every NFS write IO sends 2 down to Ceph. This can have an impact on latency as well. Due to possible PG contention caused by the XFS journal updates when multiple IO’s are in flight, you normally end up making more and more RBD’s to try and spread the load. This normally means you end up having to do storage migrations…..you can see where I’m getting at here.

I’ve been thinking for a while that CephFS works around a lot of these limitations. 

1.       It supports fancy striping, so should mean there is less per object contention
2.       There is no FS in the middle to maintain a journal and other associated IO
3.       A single large NFS mount should have none of the disadvantages seen with a single RBD
4.       No need to migrate VM’s about because of #3
5.       No need to fstrim after deleting VM’s
6.       Potential to do away with pacemaker and use LVS to do active/active NFS as ESXi does its own locking with files

With this in mind I exported a CephFS mount via NFS and then mounted it to an ESXi host as a test.

Initial results are looking very good. I’m seeing storage migrations to the NFS mount going at over 200MB/s, which equates to several thousand IO’s and seems to be writing at the intended QD32.

I need to do more testing to make sure everything works as intended, but like I say, promising initial results. 

Further testing needs to be done to see what sort of MDS performance is required, I would imagine that since we are mainly dealing with large files, it might not be that critical. I also need to consider the stability of CephFS, RBD is relatively simple and is in use by a large proportion of the Ceph community. CephFS is a lot easier to “upset”.

Nick

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com