Re: Understanding Bluestore performance characteristics

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Igor,

This has been very helpful.

I have identified (when numjobs=1, the least-worst case) that there are
approximately just as many bluestore_write_small_pre_read per second as
there are sequential-write IOPS per second:

Tue  4 Feb 22:44:34 GMT 2020
        "bluestore_write_small_pre_read": 572818,
Tue  4 Feb 22:44:36 GMT 2020
        "bluestore_write_small_pre_read": 576640,
Tue  4 Feb 22:44:37 GMT 2020
        "bluestore_write_small_pre_read": 580501,

(Approx ~3800 write small pre-read)

With fio showing (1-minute average)

  write: IOPS=3292, BW=12.9MiB/s (13.5MB/s)(772MiB/60002msec)

This is my first dive into the code, but it looks like the
"bluestore_write_small_pre_read" counter gets incremented when there is a
head-read or tail-read of the block being written.

I dont understand enough about bluestore yet, but my thinking up until this
point was that most blue-store writes would have been aligned to the
allocation-chunk size, avoiding the need for head/tail reads? I've
specifically tried to tune bluestore_min_alloc_size.

Further more, I also noticed that the majority of these writes are actually
being written to the bluestore WAL - I've also got a very high number
of deferred_write_ops (marginally lower than bluestore_write_small_pre_read
per second - ~2200 vs 3800).

I tried to tune out deferred writes by
setting bluestore_prefer_deferred_size but it did not have any impact - I'm
guessing because the deferred writes are coming from the fact that the
writes are somehow not aligned with the originally allocated chunk sizes,
and head/tail (bluestore_write_small_pre_read) writes are *always* written
as deferred writes?

This is the first time I'm dipping my toe into this, so I've got a lot to
learn - but my obvious question at this point, is: Is it possible to tune
Bluestore so that all writes are 4k aligned to avoid the head/tail reads
that I'm seeing? This is purely an RBD solution (no RGW or CephFS) and all
file systems residing on the RBD volumes use 4k block sizes so I'm assuming
all writes should all be 4k aligned?

Thanks for your help so far.

Regards
--
Brad.


On Tue, 4 Feb 2020 at 16:51, Igor Fedotov <ifedotov@xxxxxxx> wrote:

> Hi Bradley,
>
> you might want to check performance counters for this specific OSD.
>
> Available via 'ceph daemon osd.0 perf dump'  command in Nautilus. A bit
> different command for Luminous AFAIR.
>
> Then look for 'read' substring in the dump and try to find unexpectedly
> high read-related counter values if any.
>
> And/or share it here for brief analysis.
>
>
> Thanks,
>
> Igor
>
>
>
> On 2/4/2020 7:36 PM, Bradley Kite wrote:
>
> Hi Vitaliy
>
> Yes - I tried this and I can still see a number of reads (~110 iops,
> 440KB/sec) on the SSD, so it is significantly better, but the result is
> still puzzling - I'm trying to understand what is causing the reads. The
> problem is amplified with numjobs >= 2 but it looks like it is still there
> with just 1.
>
> Like some caching parameter is not correct, and the same blocks are being
> read over and over when doing a write?
>
> Could anyone advise on the best way for me to investigate further?
>
> I've tried strace (with -k) and 'perf record' but neither produce any
> useful stack traces to help understand what's going on.
>
> Regards
> --
> Brad
>
>
>
>
> On Tue, 4 Feb 2020 at 11:05, Vitaliy Filippov <vitalif@xxxxxxxxxx> <vitalif@xxxxxxxxxx> wrote:
>
>
> Hi,
>
> Try to repeat your test with numjobs=1, I've already seen strange
> behaviour with parallel jobs to one RBD image.
>
> Also as usual: https://yourcmc.ru/wiki/Ceph_performance :-)
>
>
> Hi,
>
> We have a production cluster of 27 OSD's across 5 servers (all SSD's
> running bluestore), and have started to notice a possible performance
> issue.
>
> In order to isolate the problem, we built a single server with a single
> OSD, and ran a few FIO tests. The results are puzzling, not that we were
> expecting good performance on a single OSD.
>
> In short, with a sequential write test, we are seeing huge numbers of
> reads
> hitting the actual SSD
>
> Key FIO parameters are:
>
> [global]
> pool=benchmarks
> rbdname=disk-1
> direct=1
> numjobs=2
> iodepth=1
> blocksize=4k
> group_reporting=1
> [writer]
> readwrite=write
>
> iostat results are:
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> nvme0n1           0.00   105.00 4896.00  294.00 312080.00  1696.00
> 120.92
>    17.25    3.35    3.55    0.02   0.02  12.60
>
> There are nearly ~5000 reads/second (~300 MB/sec), compared with only
> ~300
> writes (~1.5MB/sec), when we are doing a sequential write test? The
> system
> is otherwise idle, with no other workload.
>
> Running the same fio test with only 1 thread (numjobs=1) still shows a
> high
> number of reads (110).
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> nvme0n1           0.00  1281.00  110.00 1463.00   440.00 12624.00
> 16.61
>     0.03    0.02    0.05    0.02   0.02   3.40
>
> Can anyone kindly offer any comments on why we are seeing this behaviour?
>
> I can understand if there's the occasional read here and there if
> RocksDB/WAL entries need to be read from disk during the sequential write
> test, but this seems significantly high and unusual.
>
> FIO results (numjobs=2)
> writer: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
> 4096B-4096B, ioengine=rbd, iodepth=1
> ...
> fio-3.7
> Starting 2 processes
> Jobs: 1 (f=1): [W(1),_(1)][52.4%][r=0KiB/s,w=208KiB/s][r=0,w=52 IOPS][eta
> 01m:00s]
> writer: (groupid=0, jobs=2): err= 0: pid=19553: Mon Feb  3 22:46:16 2020
>   write: IOPS=34, BW=137KiB/s (140kB/s)(8228KiB/60038msec)
>     slat (nsec): min=5402, max=77083, avg=27305.33, stdev=7786.83
>     clat (msec): min=2, max=210, avg=58.32, stdev=70.54
>      lat (msec): min=2, max=210, avg=58.35, stdev=70.54
>     clat percentiles (msec):
>      |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[
> 3],
>      | 30.00th=[    3], 40.00th=[    3], 50.00th=[   54], 60.00th=[
> 62],
>      | 70.00th=[   65], 80.00th=[  174], 90.00th=[  188], 95.00th=[
> 194],
>      | 99.00th=[  201], 99.50th=[  203], 99.90th=[  209], 99.95th=[
> 209],
>      | 99.99th=[  211]
>    bw (  KiB/s): min=   24, max=  144, per=49.69%, avg=68.08,
> stdev=38.22,
> samples=239
>    iops        : min=    6, max=   36, avg=16.97, stdev= 9.55,
> samples=239
>   lat (msec)   : 4=49.83%, 10=0.10%, 100=29.90%, 250=20.18%
>   cpu          : usr=0.08%, sys=0.08%, ctx=2100, majf=0, minf=118
>   IO depths    : 1=105.3%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>
> =64=0.0%
>
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>
> =64=0.0%
>
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>
> =64=0.0%
>
>      issued rwts: total=0,2057,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=1
>
> Run status group 0 (all jobs):
>   WRITE: bw=137KiB/s (140kB/s), 137KiB/s-137KiB/s (140kB/s-140kB/s),
> io=8228KiB (8425kB), run=60038-60038msec
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
> --
> With best regards,
>    Vitaliy Filippov
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux