Re: Bluestore cluster, bad IO perf on blocksize<64k... could it be throttling ?

Ilya Dryomov <idryomov@xxxxxxxxx> · Fri, 23 Mar 2018 12:13:17 +0100

On Wed, Mar 21, 2018 at 6:50 PM, Frederic BRET <frederic.bret@xxxxxxxxxx> wrote:
> Hi all,
>
> The context :
> - Test cluster aside production one
> - Fresh install on Luminous
> - choice of Bluestore (coming from Filestore)
> - Default config (including wpq queuing)
> - 6 nodes SAS12, 14 OSD, 2 SSD, 2 x 10Gb nodes, far more Gb at each switch
> uplink...
> - R3 pool, 2 nodes per site
> - separate db (25GB) and wal (600MB) partitions on SSD for each OSD to be
> able to observe each kind of IO with iostat
> - RBD client fio --ioengine=libaio --iodepth=128 --direct=1
> - client RDB :  rbd map rbd/test_rbd -o queue_depth=1024
> - Just to point out, this is not a thread on SSD performance or adequation
> between SSD and number of OSD. These 12Gb SAS 10DWPD SSD are perfectly
> performing with lot of headroom on the production cluster even with XFS
> filestore and journals on SSDs.
> - This thread is about a possible bottleneck on low size blocks with
> rocksdb/wal/Bluestore.
>
> To begin with, Bluestore performance is really breathtaking compared to
> filestore/XFS : we saturate the 20Gb clients bandwidth on this small test
> cluster, as soon as IO blocksize=64k, a thing we couldn't achieve with
> Filestore and journals, even at 256k.
>
> The downside, all small IO blockizes (4k, 8k, 16k, 32k) are considerably
> slower and appear somewhat capped.
>
> Just to compare, here are observed latencies at 2 consecutive values for
> blocksize 64k and 32k :
> 64k :
>   write: io=55563MB, bw=1849.2MB/s, iops=29586, runt= 30048msec
>      lat (msec): min=2, max=867, avg=17.29, stdev=32.31
>
> 32k :
>   write: io=6332.2MB, bw=207632KB/s, iops=6488, runt= 31229msec
>      lat (msec): min=1, max=5111, avg=78.81, stdev=430.50
>
> Whereas 64k one is almost filling the 20Gb client connection, the 32k one is
> only getting a mere 1/10th of the bandwidth, and IOs latencies are
> multiplied by 4.5 (or get a  ~60ms pause ? ... )
>
> And we see the same constant latency at 16k, 8k and 4k :
> 16k :
>   write: io=3129.4MB, bw=102511KB/s, iops=6406, runt= 31260msec
>      lat (msec): min=0.908, max=6.67, avg=79.87, stdev=500.08
>
> 8k :
>   write: io=1592.8MB, bw=52604KB/s, iops=6575, runt= 31005msec
>      lat (msec): min=0.824, max=5.49, avg=77.82, stdev=461.61
>
> 4k :
>   write: io=837892KB, bw=26787KB/s, iops=6696, runt= 31280msec
>      lat (msec): min=0.766, max=5.45, avg=76.39, stdev=428.29
>
> To compare with filestore, on 4k IOs results I have on hand from previous
> install, we were getting almost 2x the Bluestore perfs on the exact same
> cluster :
> WRITE: io=1221.4MB, aggrb=41477KB/,s maxt=30152msec
>
> The thing is during these small blocksize fio benchmarks, nowhere nodes CPU,
> OSD, SSD, or of course network are saturated (ie. I think this has nothing
> to do with write amplification), nevertheless clients IOPS starve at low
> values.
> Shouldn't Bluestore IOPs be far higher than Filestore on small IOs too ?
>
> To summerize, here is what we can observe :
>
>
> Seeking counters, I found in "perf dump" incrementing values with slow IO
> benchs, here for 1 run of 4k fio :
>         "deferred_write_ops": 7631,
>         "deferred_write_bytes": 31457280,

bluestore data-journals any write smaller than min_alloc_size because
it has to happen in place, whereas writes equal to or larger than that
go directly to their final location on disk.  IOW anything smaller than
min_alloc_size is written twice.

The default min_alloc_size is 64k.  That is what those counters refer
to.

Thanks,

                Ilya
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com