Re: VMware + Ceph using NFS sync/async ?

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Sat, 19 Aug 2017 23:25:44 +0200

Hi Nick,
Interesting your note on PG locking, but I would be surprised if its effect is that bad. I would think that in your example the 2 ms is a total latency, the lock will probably be applied to small portion of that, so the concurrent operations are not serialized for the entire time..but again i may be wrong. Also if the lock is that bad, then we should see 4k sequential writes to be much slower than random ones in general testing, which is not the case.
Another thing that may help in vm migration as per your description is reducing the rbd stripe size to be a couple of times smaller than 2M ( 32 x 64k ).
Maged

On 2017-08-16 16:12, Nick Fisk wrote:

Hi Matt,

Well behaved applications are the problem here. ESXi sends all writes as sync writes. So although OS's will still do their own buffering, any ESXi level operation is all done as sync. This is probably seen the greatest when migrating vm's between datastores, everything gets done as sync 64KB ios meaning, copying a 1TB VM can often take nearly 24 hours.

Osama, can you describe the difference in performance you see between Openstack and ESXi and what type of operations are these? Sync writes should be the same no matter the client, except in the NFS case you will have an extra network hop and potentially a little bit of PG congestion around the FS journal on the RBd device.

Osama, you can't compare Ceph to a SAN. Just in terms of network latency you have an extra 2 hops. In ideal scenario you might be able to get Ceph write latency down to 0.5-1ms for a 4kb io, compared to to about 0.1-0.3 for a storage array. However, what you will find with Ceph is that other things start to increase this average long before you would start to see this on storage arrays. 

The migration is a good example of this. As I said, ESXi migrates a vm in 64KB io's, but does 32 of these blocks in parallel at a time. On storage arrays, these 64KB io's are coalesced in the battery protected write cached into bigger IO's before being persisted to disk. The storage array can also accept all 32 of these requests at once.

A similar thing happens in Ceph/RBD/NFS via the Ceph filestore journal, but that coalescing is now an extra 2 hops away and with a bit of extra latency introduced by the Ceph code, we are already a bit slower. But here's the killer, PG locking!!! You can't write 32 IO's in parallel to the same object/PG, each one has to be processed sequentially because of the locks. (Please someone correct me if I'm wrong here). If your 64KB write latency is 2ms, then you can only do 500 64KB IO's a second. 64KB*500=~30MB/s vs a Storage Array which would be doing the operation in the hundreds of MB/s range.

Note: When proper iSCSI for RBD support is finished, you might be able to use the VAAI offloads, which would dramatically increase performance for migrations as well.

Also once persistent SSD write caching for librbd becomes available, a lot of these problems will go away, as the SSD will behave like a storage array's write cache and will only be 1 hop away from the client as well.

From: Matt Benjamin [mailto:mbenjami@xxxxxxxxxx] 
Sent: 16 August 2017 14:49
To: Osama Hasebou <osama.hasebou@xxxxxx>
Cc: nick@xxxxxxxxxx; ceph-users <ceph-users@xxxxxxxxxxxxxx>
Subject: Re:  VMware + Ceph using NFS sync/async ?

Hi Osama,

I don't have a clear sense of the the application workflow here--and Nick appears to--but I thought it worth noting that NFSv3 and NFSv4 clients shouldn't normally need the sync mount option to achieve i/o stability with well-behaved applications.  In both versions of the protocol, an application write that is synchronous (or, more typically, the equivalent application sync barrier) should not succeed until an NFS-protocol COMMIT (or in some cases w/NFSv4, WRITE w/stable flag set) has been acknowledged by the NFS server.  If the NFS i/o stability model is insufficient for a your workflow, moreover, I'd be worried that -osync writes (which might be incompletely applied during a failure event) may not be correctly enforcing your invariant, either.

Matt

On Wed, Aug 16, 2017 at 8:33 AM, Osama Hasebou <osama.hasebou@xxxxxx> wrote:

Hi Nick,

Thanks for replying! If Ceph is combined with Openstack then, does that mean that actually when openstack writes are happening, it is not fully sync'd (as in written to disks) before it starts receiving more data, so acting as async ? In that scenario there is a chance for data loss if things go bad, i.e power outage or something like that ?

As for the slow operations, reading is quite fine when I compare it to a SAN storage system connected to VMware. It is writing data, small chunks or big ones, that suffer when trying to use the sync option with FIO for benchmarking.

In that case, I wonder, is no one using CEPH with VMware in a production environment ?

Cheers.

Regards,
Ossi

Hi Osama,

This is a known problem with many software defined storage stacks, but potentially slightly worse with Ceph due to extra overheads. Sync writes have to wait until all copies of the data are written to disk by the OSD and acknowledged back to the client. The extra network hops for replication and NFS gateways add significant latency which impacts the time it takes to carry out small writes. The Ceph code also takes time to process each IO request.

What particular operations are you finding slow? Storage vmotions are just bad, and I don't think there is much that can be done about them as they are split into lots of 64kb IO's.

One thing you can try is to force the CPU's on your OSD nodes to run at C1 cstate and force their minimum frequency to 100%. This can have quite a large impact on latency. Also you don't specify your network, but 10G is a must.

Nick

From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Osama Hasebou
Sent: 14 August 2017 12:27
To: ceph-users <ceph-users@xxxxxxxxxxxxxx>
Subject:  VMware + Ceph using NFS sync/async ?

Hi Everyone,

We started testing the idea of using Ceph storage with VMware, the idea was to provide Ceph storage through open stack to VMware, by creating a virtual machine coming from Ceph + Openstack , which acts as an NFS gateway, then mount that storage on top of VMware cluster.

When mounting the NFS exports using the sync option, we noticed a huge degradation in performance which makes it very slow to use it in production, the async option makes it much better but then there is the risk of it being risky that in case a failure shall happen, some data might be lost in that Scenario.

Now I understand that some people in the ceph community are using Ceph with VMware using NFS gateways, so if you can kindly shed some light on your experience, and if you do use it in production purpose, that would be great and how did you mitigate the sync/async options and keep write performance.

Thanks you!!!

Regards,
Ossi

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
 ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com