Re: Bluestore on HDD+SSD sync write latency experiences

Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> · Thu, 3 May 2018 11:14:34 -0400

On Thu, May 3, 2018 at 6:54 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> -----Original Message-----
> From: Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>
> Sent: 02 May 2018 22:05
> To: Nick Fisk <nick@xxxxxxxxxx>
> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
> Subject: Re:  Bluestore on HDD+SSD sync write latency experiences
>
> Hi Nick,
>
> On Tue, May 1, 2018 at 4:50 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>> Hi all,
>>
>>
>>
>> Slowly getting round to migrating clusters to Bluestore but I am
>> interested in how people are handling the potential change in write
>> latency coming from Filestore? Or maybe nobody is really seeing much difference?
>>
>>
>>
>> As we all know, in Bluestore, writes are not double written and in
>> most cases go straight to disk. Whilst this is awesome for people with
>> pure SSD or pure HDD clusters as the amount of overhead is drastically
>> reduced, for people with HDD+SSD journals in Filestore land, the
>> double write had the side effect of acting like a battery backed
>> cache, accelerating writes when not under saturation.
>>
>>
>>
>> In some brief testing I am seeing Filestore OSD’s with NVME journal
>> show an average apply latency of around 1-2ms whereas some new
>> Bluestore OSD’s in the same cluster are showing 20-40ms. I am fairly
>> certain this is due to writes exhibiting the latency of the underlying
>> 7.2k disk. Note, cluster is very lightly loaded, this is not anything being driven into saturation.
>>
>>
>>
>> I know there is a deferred write tuning knob which adjusts the cutover
>> for when an object is double written, but at the default of 32kb, I
>> suspect a lot of IO’s even in the 1MB area are still drastically
>> slower going straight to disk than if double written to NVME 1st. Has
>> anybody else done any investigation in this area? Is there any long
>> turn harm at running a cluster deferring writes up to 1MB+ in size to
>> mimic the Filestore double write approach?
>>
>>
>>
>> I also suspect after looking through github that deferred writes only
>> happen when overwriting an existing object or blob (not sure which
>> case applies), so new allocations are still written straight to disk. Can anyone confirm?
>>
>>
>>
>> PS. If your spinning disks are connected via a RAID controller with
>> BBWC then you are not affected by this.
>
> We saw this behavior even on Areca 1883, which does buffer HDD writes.
> The way out was to put WAL and DB on NVMe drives and that solved performance problems.
>
> Just to confirm, our problem is not poor performance of the RocksDB when running on HDD, but the direct write to disk of data. Or have I misunderstood your comment?

Correct, the write latencies were quite high, then we moved WAL and DB
to NVMe PCIe devices and the latencies greatly improved.  Almost like
a Filestore journal behavior.

Regards,
Alex

>>
>>
>>
>> Thanks,
>>
>> Nick
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com