On Thu, May 3, 2018 at 6:54 AM, Nick Fisk <nick@xxxxxxxxxx> wrote: > -----Original Message----- > From: Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> > Sent: 02 May 2018 22:05 > To: Nick Fisk <nick@xxxxxxxxxx> > Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx> > Subject: Re: Bluestore on HDD+SSD sync write latency experiences > > Hi Nick, > > On Tue, May 1, 2018 at 4:50 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: >> Hi all, >> >> >> >> Slowly getting round to migrating clusters to Bluestore but I am >> interested in how people are handling the potential change in write >> latency coming from Filestore? Or maybe nobody is really seeing much difference? >> >> >> >> As we all know, in Bluestore, writes are not double written and in >> most cases go straight to disk. Whilst this is awesome for people with >> pure SSD or pure HDD clusters as the amount of overhead is drastically >> reduced, for people with HDD+SSD journals in Filestore land, the >> double write had the side effect of acting like a battery backed >> cache, accelerating writes when not under saturation. >> >> >> >> In some brief testing I am seeing Filestore OSD’s with NVME journal >> show an average apply latency of around 1-2ms whereas some new >> Bluestore OSD’s in the same cluster are showing 20-40ms. I am fairly >> certain this is due to writes exhibiting the latency of the underlying >> 7.2k disk. Note, cluster is very lightly loaded, this is not anything being driven into saturation. >> >> >> >> I know there is a deferred write tuning knob which adjusts the cutover >> for when an object is double written, but at the default of 32kb, I >> suspect a lot of IO’s even in the 1MB area are still drastically >> slower going straight to disk than if double written to NVME 1st. Has >> anybody else done any investigation in this area? Is there any long >> turn harm at running a cluster deferring writes up to 1MB+ in size to >> mimic the Filestore double write approach? >> >> >> >> I also suspect after looking through github that deferred writes only >> happen when overwriting an existing object or blob (not sure which >> case applies), so new allocations are still written straight to disk. Can anyone confirm? >> >> >> >> PS. If your spinning disks are connected via a RAID controller with >> BBWC then you are not affected by this. > > We saw this behavior even on Areca 1883, which does buffer HDD writes. > The way out was to put WAL and DB on NVMe drives and that solved performance problems. > > Just to confirm, our problem is not poor performance of the RocksDB when running on HDD, but the direct write to disk of data. Or have I misunderstood your comment? Correct, the write latencies were quite high, then we moved WAL and DB to NVMe PCIe devices and the latencies greatly improved. Almost like a Filestore journal behavior. Regards, Alex >> >> >> >> Thanks, >> >> Nick >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com