Re: Bluestore on HDD+SSD sync write latency experiences

Nick Fisk <nick@xxxxxxxxxx> · Thu, 03 May 2018 15:55:35 +0000

Hi Dan,

Quoting Dan van der Ster <dan@xxxxxxxxxxxxxx>:

Hi Nick,

Our latency probe results (4kB rados bench) didn't change noticeably
after converting a test cluster from FileStore (sata SSD journal) to
BlueStore (sata SSD db). Those 4kB writes take 3-4ms on average from a
random VM in our data centre. (So bluestore DB seems equivalent to
FileStore journal for small writes).

Otherwise, our other monitoring (osd log analysis) shows that the vast
majority of writes are under 32kB, and the average write size is 42kB
(with a long tail out to 4MB).

So... do you think is this a *real* issue that would impact user
observed latency, given that they are mostly doing small writes?
(maybe your environment is very different?)
I'm not saying that tuning the deferred write threshold up wouldn't
help, but it's not obvious that deferring writes is better on the
whole.

Probably for a lot of users running VM's, like you say most writes  
will be under 32kB and won't notice much difference. My workload is  
where the client is submitting largely sequential 1MB writes in sync  
(NFS), but at a fairly low queue depth. The cluster needs to ack them  
as fast as possible. So in this case writing the IO's through the NVME  
first seems to help by quite a large margin.

I'm curious what was the original rationale for 32kB?

Cheers, Dan

On Tue, May 1, 2018 at 10:50 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
Hi all,

Slowly getting round to migrating clusters to Bluestore but I am interested
in how people are handling the potential change in write latency coming from
Filestore? Or maybe nobody is really seeing much difference?

As we all know, in Bluestore, writes are not double written and in most
cases go straight to disk. Whilst this is awesome for people with pure SSD
or pure HDD clusters as the amount of overhead is drastically reduced, for
people with HDD+SSD journals in Filestore land, the double write had the
side effect of acting like a battery backed cache, accelerating writes when
not under saturation.

In some brief testing I am seeing Filestore OSD’s with NVME journal show an
average apply latency of around 1-2ms whereas some new Bluestore OSD’s in
the same cluster are showing 20-40ms. I am fairly certain this is due to
writes exhibiting the latency of the underlying 7.2k disk. Note, cluster is
very lightly loaded, this is not anything being driven into saturation.

I know there is a deferred write tuning knob which adjusts the cutover for
when an object is double written, but at the default of 32kb, I suspect a
lot of IO’s even in the 1MB area are still drastically slower going straight
to disk than if double written to NVME 1st. Has anybody else done any
investigation in this area? Is there any long turn harm at running a cluster
deferring writes up to 1MB+ in size to mimic the Filestore double write
approach?

I also suspect after looking through github that deferred writes only happen
when overwriting an existing object or blob (not sure which case applies),
so new allocations are still written straight to disk. Can anyone confirm?

PS. If your spinning disks are connected via a RAID controller with BBWC
then you are not affected by this.

Thanks,

Nick

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com