On Wed, Mar 21, 2018 at 6:50 PM, Frederic BRET <frederic.bret@xxxxxxxxxx> wrote: > Hi all, > > The context : > - Test cluster aside production one > - Fresh install on Luminous > - choice of Bluestore (coming from Filestore) > - Default config (including wpq queuing) > - 6 nodes SAS12, 14 OSD, 2 SSD, 2 x 10Gb nodes, far more Gb at each switch > uplink... > - R3 pool, 2 nodes per site > - separate db (25GB) and wal (600MB) partitions on SSD for each OSD to be > able to observe each kind of IO with iostat > - RBD client fio --ioengine=libaio --iodepth=128 --direct=1 > - client RDB : rbd map rbd/test_rbd -o queue_depth=1024 > - Just to point out, this is not a thread on SSD performance or adequation > between SSD and number of OSD. These 12Gb SAS 10DWPD SSD are perfectly > performing with lot of headroom on the production cluster even with XFS > filestore and journals on SSDs. > - This thread is about a possible bottleneck on low size blocks with > rocksdb/wal/Bluestore. > > To begin with, Bluestore performance is really breathtaking compared to > filestore/XFS : we saturate the 20Gb clients bandwidth on this small test > cluster, as soon as IO blocksize=64k, a thing we couldn't achieve with > Filestore and journals, even at 256k. > > The downside, all small IO blockizes (4k, 8k, 16k, 32k) are considerably > slower and appear somewhat capped. > > Just to compare, here are observed latencies at 2 consecutive values for > blocksize 64k and 32k : > 64k : > write: io=55563MB, bw=1849.2MB/s, iops=29586, runt= 30048msec > lat (msec): min=2, max=867, avg=17.29, stdev=32.31 > > 32k : > write: io=6332.2MB, bw=207632KB/s, iops=6488, runt= 31229msec > lat (msec): min=1, max=5111, avg=78.81, stdev=430.50 > > Whereas 64k one is almost filling the 20Gb client connection, the 32k one is > only getting a mere 1/10th of the bandwidth, and IOs latencies are > multiplied by 4.5 (or get a ~60ms pause ? ... ) > > And we see the same constant latency at 16k, 8k and 4k : > 16k : > write: io=3129.4MB, bw=102511KB/s, iops=6406, runt= 31260msec > lat (msec): min=0.908, max=6.67, avg=79.87, stdev=500.08 > > 8k : > write: io=1592.8MB, bw=52604KB/s, iops=6575, runt= 31005msec > lat (msec): min=0.824, max=5.49, avg=77.82, stdev=461.61 > > 4k : > write: io=837892KB, bw=26787KB/s, iops=6696, runt= 31280msec > lat (msec): min=0.766, max=5.45, avg=76.39, stdev=428.29 > > To compare with filestore, on 4k IOs results I have on hand from previous > install, we were getting almost 2x the Bluestore perfs on the exact same > cluster : > WRITE: io=1221.4MB, aggrb=41477KB/,s maxt=30152msec > > The thing is during these small blocksize fio benchmarks, nowhere nodes CPU, > OSD, SSD, or of course network are saturated (ie. I think this has nothing > to do with write amplification), nevertheless clients IOPS starve at low > values. > Shouldn't Bluestore IOPs be far higher than Filestore on small IOs too ? > > To summerize, here is what we can observe : > > > Seeking counters, I found in "perf dump" incrementing values with slow IO > benchs, here for 1 run of 4k fio : > "deferred_write_ops": 7631, > "deferred_write_bytes": 31457280, bluestore data-journals any write smaller than min_alloc_size because it has to happen in place, whereas writes equal to or larger than that go directly to their final location on disk. IOW anything smaller than min_alloc_size is written twice. The default min_alloc_size is 64k. That is what those counters refer to. Thanks, Ilya _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com