RE: Bluestore deferred writes for new objects

Nick Fisk <nick@xxxxxxxxxx> · Thu, 11 Oct 2018 21:16:48 +0100

> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
> Sent: 11 October 2018 14:25
> To: nick@xxxxxxxxxx; 'Sage Weil' <sage@xxxxxxxxxxxx>
> Cc: ceph-devel@xxxxxxxxxxxxxxx
> Subject: Re: Bluestore deferred writes for new objects
> 
> On 10/11/2018 07:14 AM, Nick Fisk wrote:
> 
> >> -----Original Message-----
> >> From: ceph-devel-owner@xxxxxxxxxxxxxxx
> >> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil
> >> Sent: 10 October 2018 23:37
> >> To: Nick Fisk <nick@xxxxxxxxxx>
> >> Cc: ceph-devel@xxxxxxxxxxxxxxx
> >> Subject: Re: Bluestore deferred writes for new objects
> >>
> >> On Wed, 10 Oct 2018, Nick Fisk wrote:
> >>> Following up from a discussion on the performance call last week.
> >>>
> >>> Is anybody able to confirm the behaviour of Bluestore deferred writes with new objects?
> >>>
> >>>  From my testing it appears that new object are always directly
> >>> written to the underlying block device and not buffered into flash, whereas existing objects <64KB are.
> >> I just did a quick test (on master) and it looks like the deferred
> >> writes are working as expected in that they *do* apply to new objects
> >> (as well as existing ones).  Can you be a bit more specific about
> >> what you observed?  (which version?  what workload?)
> >>
> > This is on Mimic 13.2.2, 7.2k disks with SSD for DB. Three observations I have made.
> >
> > 1. RADOS bench doing QD=1 4k objects is a lot slower than writing with
> > FIO (directio) QD=1 4kb IO's to a fully thickened RBD (about 10x) 2.
> > RADOS bench seems to increment the bluestore_write_small_new counter,
> > whereas the fio test increments bluestore_write_small_deferred.
> > Although deferred_write_ops looks like it increases in both cases 3.
> > Compared to Filestore the RADOS bench test is also slower (again about
> > 10x)
> 
> I've noticed slower write behavior when creating objects (RBD prefill and rados) than overwrites to existing RBD objects.  It wasn't
> anywhere near 10x, but I wasn't focusing on the QD=1 use case either.  I've got a pile of things I need to work on, but I don't want to
> lose this one because this is an important case to track down.  I want to try to replicate it in-house next week.
> 
> Nick, can you send me the rados bench and fio cmdline you are using?  I imagine so long as it's low QD and object creates vs RBD
> overwrites it should be pretty obvious, but the exact invocations wouldn't hurt to have.

I've confirmed I'm seeing the 2 deferred_write log entries when doing the rados bench test, I will continue to plough through the debug logs to see if I can see any differences between the two IO tests

I've been doing some more tests with fio and started to get my head round this further. If I run a "randwrite" QD=1 4k with fio instead of just "write", I see similar results to rados bench. This indicates that the behaviour is not related to deferred objects with new writes as I first suspected, but either the difference between the deferring of random vs sequential IO, or the difference between writing to lots of different objects vs doing lots of 4kb IO to a small number of objects.

My assumption on this that a HDD+SSD OSD should be able to match the write latency of a SSD OSD as long as the underlying HDD never starts to saturate and cause the deferred write queue to back up? In these tests, I'm never seeing the HDD getting about ~10% utilisation

Hopefully you can see the potential for gain in the tests below, all 3x pools. 

Fio - randwrite(HDD+SSD)
rbd_iodepth1: (groupid=0, jobs=1): err= 0: pid=2748105: Thu Oct 11 21:02:44 2018
  write: io=5688.0KB, bw=795155B/s, iops=194, runt=  7325msec
    slat (usec): min=7, max=77, avg=13.94, stdev= 5.08
    clat (usec): min=874, max=255057, avg=5134.22, stdev=22184.62
     lat (usec): min=886, max=255083, avg=5148.16, stdev=22184.62
    clat percentiles (usec):
     |  1.00th=[  916],  5.00th=[  972], 10.00th=[ 1012], 20.00th=[ 1064],
     | 30.00th=[ 1096], 40.00th=[ 1128], 50.00th=[ 1160], 60.00th=[ 1208],
     | 70.00th=[ 1256], 80.00th=[ 1336], 90.00th=[ 2928], 95.00th=[13120],
     | 99.00th=[146432], 99.50th=[195584], 99.90th=[230400], 99.95th=[254976],
     | 99.99th=[254976]

Fio - write(HDD+SSD)
rbd_iodepth1: (groupid=0, jobs=1): err= 0: pid=2749067: Thu Oct 11 21:04:01 2018
  write: io=21628KB, bw=2049.7KB/s, iops=512, runt= 10552msec
    slat (usec): min=8, max=100, avg=14.79, stdev= 5.65
    clat (usec): min=724, max=72708, avg=1934.28, stdev=4339.77
     lat (usec): min=735, max=72726, avg=1949.07, stdev=4339.97
    clat percentiles (usec):
     |  1.00th=[  756],  5.00th=[  780], 10.00th=[  804], 20.00th=[  844],
     | 30.00th=[  884], 40.00th=[  916], 50.00th=[  956], 60.00th=[  980],
     | 70.00th=[ 1012], 80.00th=[ 1064], 90.00th=[ 1272], 95.00th=[ 8256],
     | 99.00th=[23168], 99.50th=[30592], 99.90th=[45312], 99.95th=[61184],
     | 99.99th=[73216]

Fio - randwrite or write(SSD)
rbd_iodepth1: (groupid=0, jobs=1): err= 0: pid=2749597: Thu Oct 11 21:05:18 2018
  write: io=22552KB, bw=3495.1KB/s, iops=873, runt=  6451msec
    slat (usec): min=8, max=69, avg=14.09, stdev= 5.04
    clat (usec): min=771, max=46175, avg=1127.39, stdev=839.94
     lat (usec): min=783, max=46188, avg=1141.47, stdev=840.06
    clat percentiles (usec):
     |  1.00th=[  836],  5.00th=[  868], 10.00th=[  900], 20.00th=[  948],
     | 30.00th=[  996], 40.00th=[ 1032], 50.00th=[ 1064], 60.00th=[ 1096],
     | 70.00th=[ 1128], 80.00th=[ 1208], 90.00th=[ 1352], 95.00th=[ 1512],
     | 99.00th=[ 1912], 99.50th=[ 2160], 99.90th=[10688], 99.95th=[20352],
     | 99.99th=[46336]

For reference, my cluster with superfast CPU's + Filestore(currently) can push ~1600IOP's as long as object files are in SLAB. iSCSI storage array will probably push just over 3000IOPs.

> 
> Mark