Re: Understanding Bluestore performance characteristics

Bradley Kite <bradley.kite@xxxxxxxxx> · Tue, 4 Feb 2020 16:36:18 +0000

Hi Vitaliy

Yes - I tried this and I can still see a number of reads (~110 iops,
440KB/sec) on the SSD, so it is significantly better, but the result is
still puzzling - I'm trying to understand what is causing the reads. The
problem is amplified with numjobs >= 2 but it looks like it is still there
with just 1.

Like some caching parameter is not correct, and the same blocks are being
read over and over when doing a write?

Could anyone advise on the best way for me to investigate further?

I've tried strace (with -k) and 'perf record' but neither produce any
useful stack traces to help understand what's going on.

Regards
--
Brad

On Tue, 4 Feb 2020 at 11:05, Vitaliy Filippov <vitalif@xxxxxxxxxx> wrote:

> Hi,
>
> Try to repeat your test with numjobs=1, I've already seen strange
> behaviour with parallel jobs to one RBD image.
>
> Also as usual: https://yourcmc.ru/wiki/Ceph_performance :-)
>
> > Hi,
> >
> > We have a production cluster of 27 OSD's across 5 servers (all SSD's
> > running bluestore), and have started to notice a possible performance
> > issue.
> >
> > In order to isolate the problem, we built a single server with a single
> > OSD, and ran a few FIO tests. The results are puzzling, not that we were
> > expecting good performance on a single OSD.
> >
> > In short, with a sequential write test, we are seeing huge numbers of
> > reads
> > hitting the actual SSD
> >
> > Key FIO parameters are:
> >
> > [global]
> > pool=benchmarks
> > rbdname=disk-1
> > direct=1
> > numjobs=2
> > iodepth=1
> > blocksize=4k
> > group_reporting=1
> > [writer]
> > readwrite=write
> >
> > iostat results are:
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> > avgrq-sz
> > avgqu-sz   await r_await w_await  svctm  %util
> > nvme0n1           0.00   105.00 4896.00  294.00 312080.00  1696.00
> > 120.92
> >    17.25    3.35    3.55    0.02   0.02  12.60
> >
> > There are nearly ~5000 reads/second (~300 MB/sec), compared with only
> > ~300
> > writes (~1.5MB/sec), when we are doing a sequential write test? The
> > system
> > is otherwise idle, with no other workload.
> >
> > Running the same fio test with only 1 thread (numjobs=1) still shows a
> > high
> > number of reads (110).
> >
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> > avgrq-sz
> > avgqu-sz   await r_await w_await  svctm  %util
> > nvme0n1           0.00  1281.00  110.00 1463.00   440.00 12624.00
> > 16.61
> >     0.03    0.02    0.05    0.02   0.02   3.40
> >
> > Can anyone kindly offer any comments on why we are seeing this behaviour?
> >
> > I can understand if there's the occasional read here and there if
> > RocksDB/WAL entries need to be read from disk during the sequential write
> > test, but this seems significantly high and unusual.
> >
> > FIO results (numjobs=2)
> > writer: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> > 4096B-4096B, ioengine=rbd, iodepth=1
> > ...
> > fio-3.7
> > Starting 2 processes
> > Jobs: 1 (f=1): [W(1),_(1)][52.4%][r=0KiB/s,w=208KiB/s][r=0,w=52 IOPS][eta
> > 01m:00s]
> > writer: (groupid=0, jobs=2): err= 0: pid=19553: Mon Feb  3 22:46:16 2020
> >   write: IOPS=34, BW=137KiB/s (140kB/s)(8228KiB/60038msec)
> >     slat (nsec): min=5402, max=77083, avg=27305.33, stdev=7786.83
> >     clat (msec): min=2, max=210, avg=58.32, stdev=70.54
> >      lat (msec): min=2, max=210, avg=58.35, stdev=70.54
> >     clat percentiles (msec):
> >      |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[
> > 3],
> >      | 30.00th=[    3], 40.00th=[    3], 50.00th=[   54], 60.00th=[
> > 62],
> >      | 70.00th=[   65], 80.00th=[  174], 90.00th=[  188], 95.00th=[
> > 194],
> >      | 99.00th=[  201], 99.50th=[  203], 99.90th=[  209], 99.95th=[
> > 209],
> >      | 99.99th=[  211]
> >    bw (  KiB/s): min=   24, max=  144, per=49.69%, avg=68.08,
> > stdev=38.22,
> > samples=239
> >    iops        : min=    6, max=   36, avg=16.97, stdev= 9.55,
> > samples=239
> >   lat (msec)   : 4=49.83%, 10=0.10%, 100=29.90%, 250=20.18%
> >   cpu          : usr=0.08%, sys=0.08%, ctx=2100, majf=0, minf=118
> >   IO depths    : 1=105.3%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >> =64=0.0%
> >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >> =64=0.0%
> >      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >> =64=0.0%
> >      issued rwts: total=0,2057,0,0 short=0,0,0,0 dropped=0,0,0,0
> >      latency   : target=0, window=0, percentile=100.00%, depth=1
> >
> > Run status group 0 (all jobs):
> >   WRITE: bw=137KiB/s (140kB/s), 137KiB/s-137KiB/s (140kB/s-140kB/s),
> > io=8228KiB (8425kB), run=60038-60038msec
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
> --
> With best regards,
>    Vitaliy Filippov
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx