Re: Bluestore on HDD+SSD sync write latency experiences

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



-----Original Message-----
From: Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> 
Sent: 02 May 2018 22:05
To: Nick Fisk <nick@xxxxxxxxxx>
Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
Subject: Re:  Bluestore on HDD+SSD sync write latency experiences

Hi Nick,

On Tue, May 1, 2018 at 4:50 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> Hi all,
>
>
>
> Slowly getting round to migrating clusters to Bluestore but I am 
> interested in how people are handling the potential change in write 
> latency coming from Filestore? Or maybe nobody is really seeing much difference?
>
>
>
> As we all know, in Bluestore, writes are not double written and in 
> most cases go straight to disk. Whilst this is awesome for people with 
> pure SSD or pure HDD clusters as the amount of overhead is drastically 
> reduced, for people with HDD+SSD journals in Filestore land, the 
> double write had the side effect of acting like a battery backed 
> cache, accelerating writes when not under saturation.
>
>
>
> In some brief testing I am seeing Filestore OSD’s with NVME journal 
> show an average apply latency of around 1-2ms whereas some new 
> Bluestore OSD’s in the same cluster are showing 20-40ms. I am fairly 
> certain this is due to writes exhibiting the latency of the underlying 
> 7.2k disk. Note, cluster is very lightly loaded, this is not anything being driven into saturation.
>
>
>
> I know there is a deferred write tuning knob which adjusts the cutover 
> for when an object is double written, but at the default of 32kb, I 
> suspect a lot of IO’s even in the 1MB area are still drastically 
> slower going straight to disk than if double written to NVME 1st. Has 
> anybody else done any investigation in this area? Is there any long 
> turn harm at running a cluster deferring writes up to 1MB+ in size to 
> mimic the Filestore double write approach?
>
>
>
> I also suspect after looking through github that deferred writes only 
> happen when overwriting an existing object or blob (not sure which 
> case applies), so new allocations are still written straight to disk. Can anyone confirm?
>
>
>
> PS. If your spinning disks are connected via a RAID controller with 
> BBWC then you are not affected by this.

We saw this behavior even on Areca 1883, which does buffer HDD writes.
The way out was to put WAL and DB on NVMe drives and that solved performance problems.

Just to confirm, our problem is not poor performance of the RocksDB when running on HDD, but the direct write to disk of data. Or have I misunderstood your comment?

--
Alex Gorbachev
Storcium

>
>
>
> Thanks,
>
> Nick
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux