Re: Bluestore on HDD+SSD sync write latency experiences

Nick Fisk <nick@xxxxxxxxxx> · Thu, 3 May 2018 12:59:56 +0100

Hi Nick,
On 5/1/2018 11:50 PM, Nick Fisk wrote:
Hi all,

Slowly getting round to migrating clusters to Bluestore but I am interested in how people are handling the potential change in write latency coming from Filestore? Or maybe nobody is really seeing much difference?

As we all know, in Bluestore, writes are not double written and in most cases go straight to disk. Whilst this is awesome for people with pure SSD or pure HDD clusters as the amount of overhead is drastically reduced, for people with HDD+SSD journals in Filestore land, the double write had the side effect of acting like a battery backed cache, accelerating writes when not under saturation.

In some brief testing I am seeing Filestore OSD’s with NVME journal show an average apply latency of around 1-2ms whereas some new Bluestore OSD’s in the same cluster are showing 20-40ms. I am fairly certain this is due to writes exhibiting the latency of the underlying 7.2k disk. Note, cluster is very lightly loaded, this is not anything being driven into saturation.

I know there is a deferred write tuning knob which adjusts the cutover for when an object is double written, but at the default of 32kb, I suspect a lot of IO’s even in the 1MB area are still drastically slower going straight to disk than if double written to NVME 1^st. Has anybody else done any investigation in this area? Is there any long turn harm at running a cluster deferring writes up to 1MB+ in size to mimic the Filestore double write  approach?
This should work fine with low load but be careful when load is raising. RocksDB and corresponding stuff around it might become a bottleneck in this scenario.

Yep, this cluster has extremely low load, but client is submitting largely sequential 1MB writes in sync (NFS). Cluster needs to ack them as fast as possible.

I also suspect after looking through github that deferred writes only happen when overwriting an existing object or blob (not sure which case applies), so new allocations are still written straight to disk. Can anyone confirm?
"small" writes (length < min_alloc_size) are direct if they go to unused chunk (4K or more depending on checksum settings) of an existing mutable block and write length > bluestore_prefer_deferred_size only. 
E.g. appending with 4K data  blocks to an object at HDD will trigger deferred mode for the first of every 16 writes (given that default min_alloc_size for HDD is 64K). Rest 15 go direct.

"big" writes are unconditionally deferred if length <= bluestore_prefer_deferred_size.

So according to defaults and assuming an RBD comprised of 4MB objects:
1. Write between 32K and 64K will go direct if written to an unused chunks
2. Write below 32K to existing and new chunk will be deferred
3. Everything above 32K will be direct

If I was to increase the deferred write to 1MB:
1. Everything between 64K and 1MB is deferred
2. Above 1MB is direct
3. Below 64K, still deferred for new and existing because bluestore_prefer_deferred_size > min_alloc_size

PS. If your spinning disks are connected via a RAID controller with BBWC then you are not affected by this.

Thanks,
Nick

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com