Hi, We have a production cluster of 27 OSD's across 5 servers (all SSD's running bluestore), and have started to notice a possible performance issue. In order to isolate the problem, we built a single server with a single OSD, and ran a few FIO tests. The results are puzzling, not that we were expecting good performance on a single OSD. In short, with a sequential write test, we are seeing huge numbers of reads hitting the actual SSD Key FIO parameters are: [global] pool=benchmarks rbdname=disk-1 direct=1 numjobs=2 iodepth=1 blocksize=4k group_reporting=1 [writer] readwrite=write iostat results are: Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util nvme0n1 0.00 105.00 4896.00 294.00 312080.00 1696.00 120.92 17.25 3.35 3.55 0.02 0.02 12.60 There are nearly ~5000 reads/second (~300 MB/sec), compared with only ~300 writes (~1.5MB/sec), when we are doing a sequential write test? The system is otherwise idle, with no other workload. Running the same fio test with only 1 thread (numjobs=1) still shows a high number of reads (110). Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util nvme0n1 0.00 1281.00 110.00 1463.00 440.00 12624.00 16.61 0.03 0.02 0.05 0.02 0.02 3.40 Can anyone kindly offer any comments on why we are seeing this behaviour? I can understand if there's the occasional read here and there if RocksDB/WAL entries need to be read from disk during the sequential write test, but this seems significantly high and unusual. FIO results (numjobs=2) writer: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=1 ... fio-3.7 Starting 2 processes Jobs: 1 (f=1): [W(1),_(1)][52.4%][r=0KiB/s,w=208KiB/s][r=0,w=52 IOPS][eta 01m:00s] writer: (groupid=0, jobs=2): err= 0: pid=19553: Mon Feb 3 22:46:16 2020 write: IOPS=34, BW=137KiB/s (140kB/s)(8228KiB/60038msec) slat (nsec): min=5402, max=77083, avg=27305.33, stdev=7786.83 clat (msec): min=2, max=210, avg=58.32, stdev=70.54 lat (msec): min=2, max=210, avg=58.35, stdev=70.54 clat percentiles (msec): | 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 3], 20.00th=[ 3], | 30.00th=[ 3], 40.00th=[ 3], 50.00th=[ 54], 60.00th=[ 62], | 70.00th=[ 65], 80.00th=[ 174], 90.00th=[ 188], 95.00th=[ 194], | 99.00th=[ 201], 99.50th=[ 203], 99.90th=[ 209], 99.95th=[ 209], | 99.99th=[ 211] bw ( KiB/s): min= 24, max= 144, per=49.69%, avg=68.08, stdev=38.22, samples=239 iops : min= 6, max= 36, avg=16.97, stdev= 9.55, samples=239 lat (msec) : 4=49.83%, 10=0.10%, 100=29.90%, 250=20.18% cpu : usr=0.08%, sys=0.08%, ctx=2100, majf=0, minf=118 IO depths : 1=105.3%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,2057,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=137KiB/s (140kB/s), 137KiB/s-137KiB/s (140kB/s-140kB/s), io=8228KiB (8425kB), run=60038-60038msec _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx