Hi Vitaliy Yes - I tried this and I can still see a number of reads (~110 iops, 440KB/sec) on the SSD, so it is significantly better, but the result is still puzzling - I'm trying to understand what is causing the reads. The problem is amplified with numjobs >= 2 but it looks like it is still there with just 1. Like some caching parameter is not correct, and the same blocks are being read over and over when doing a write? Could anyone advise on the best way for me to investigate further? I've tried strace (with -k) and 'perf record' but neither produce any useful stack traces to help understand what's going on. Regards -- Brad On Tue, 4 Feb 2020 at 11:05, Vitaliy Filippov <vitalif@xxxxxxxxxx> wrote: > Hi, > > Try to repeat your test with numjobs=1, I've already seen strange > behaviour with parallel jobs to one RBD image. > > Also as usual: https://yourcmc.ru/wiki/Ceph_performance :-) > > > Hi, > > > > We have a production cluster of 27 OSD's across 5 servers (all SSD's > > running bluestore), and have started to notice a possible performance > > issue. > > > > In order to isolate the problem, we built a single server with a single > > OSD, and ran a few FIO tests. The results are puzzling, not that we were > > expecting good performance on a single OSD. > > > > In short, with a sequential write test, we are seeing huge numbers of > > reads > > hitting the actual SSD > > > > Key FIO parameters are: > > > > [global] > > pool=benchmarks > > rbdname=disk-1 > > direct=1 > > numjobs=2 > > iodepth=1 > > blocksize=4k > > group_reporting=1 > > [writer] > > readwrite=write > > > > iostat results are: > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > > avgrq-sz > > avgqu-sz await r_await w_await svctm %util > > nvme0n1 0.00 105.00 4896.00 294.00 312080.00 1696.00 > > 120.92 > > 17.25 3.35 3.55 0.02 0.02 12.60 > > > > There are nearly ~5000 reads/second (~300 MB/sec), compared with only > > ~300 > > writes (~1.5MB/sec), when we are doing a sequential write test? The > > system > > is otherwise idle, with no other workload. > > > > Running the same fio test with only 1 thread (numjobs=1) still shows a > > high > > number of reads (110). > > > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s > > avgrq-sz > > avgqu-sz await r_await w_await svctm %util > > nvme0n1 0.00 1281.00 110.00 1463.00 440.00 12624.00 > > 16.61 > > 0.03 0.02 0.05 0.02 0.02 3.40 > > > > Can anyone kindly offer any comments on why we are seeing this behaviour? > > > > I can understand if there's the occasional read here and there if > > RocksDB/WAL entries need to be read from disk during the sequential write > > test, but this seems significantly high and unusual. > > > > FIO results (numjobs=2) > > writer: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) > > 4096B-4096B, ioengine=rbd, iodepth=1 > > ... > > fio-3.7 > > Starting 2 processes > > Jobs: 1 (f=1): [W(1),_(1)][52.4%][r=0KiB/s,w=208KiB/s][r=0,w=52 IOPS][eta > > 01m:00s] > > writer: (groupid=0, jobs=2): err= 0: pid=19553: Mon Feb 3 22:46:16 2020 > > write: IOPS=34, BW=137KiB/s (140kB/s)(8228KiB/60038msec) > > slat (nsec): min=5402, max=77083, avg=27305.33, stdev=7786.83 > > clat (msec): min=2, max=210, avg=58.32, stdev=70.54 > > lat (msec): min=2, max=210, avg=58.35, stdev=70.54 > > clat percentiles (msec): > > | 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 3], 20.00th=[ > > 3], > > | 30.00th=[ 3], 40.00th=[ 3], 50.00th=[ 54], 60.00th=[ > > 62], > > | 70.00th=[ 65], 80.00th=[ 174], 90.00th=[ 188], 95.00th=[ > > 194], > > | 99.00th=[ 201], 99.50th=[ 203], 99.90th=[ 209], 99.95th=[ > > 209], > > | 99.99th=[ 211] > > bw ( KiB/s): min= 24, max= 144, per=49.69%, avg=68.08, > > stdev=38.22, > > samples=239 > > iops : min= 6, max= 36, avg=16.97, stdev= 9.55, > > samples=239 > > lat (msec) : 4=49.83%, 10=0.10%, 100=29.90%, 250=20.18% > > cpu : usr=0.08%, sys=0.08%, ctx=2100, majf=0, minf=118 > > IO depths : 1=105.3%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, > >> =64=0.0% > > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >> =64=0.0% > > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >> =64=0.0% > > issued rwts: total=0,2057,0,0 short=0,0,0,0 dropped=0,0,0,0 > > latency : target=0, window=0, percentile=100.00%, depth=1 > > > > Run status group 0 (all jobs): > > WRITE: bw=137KiB/s (140kB/s), 137KiB/s-137KiB/s (140kB/s-140kB/s), > > io=8228KiB (8425kB), run=60038-60038msec > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > -- > With best regards, > Vitaliy Filippov > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx