Re: Interpretation Guidance for Slow Requests

Peter Maloney <peter.maloney@xxxxxxxxxxxxxxxxxxxx> · Wed, 7 Dec 2016 14:39:05 +0100



On 12/07/16 13:52, Christian Balzer wrote:
> On Wed, 7 Dec 2016 12:39:11 +0100 Christian Theune wrote:
>
> | cartman06 ~ # fio --filename=/dev/sdl --direct=1 --sync=1 --rw=write --bs=128k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test
> | journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> | fio-2.0.14
> | Starting 1 process
> | Jobs: 1 (f=1): [W] [100.0% done] [0K/88852K/0K /s] [0 /22.3K/0  iops] [eta 00m:00s]
> | journal-test: (groupid=0, jobs=1): err= 0: pid=28606: Wed Dec  7 11:59:36 2016
> |   write: io=5186.7MB, bw=88517KB/s, iops=22129 , runt= 60001msec
> |     clat (usec): min=37 , max=1519 , avg=43.77, stdev=10.89
> |      lat (usec): min=37 , max=1519 , avg=43.94, stdev=10.90
> |     clat percentiles (usec):
> |      |  1.00th=[   39],  5.00th=[   40], 10.00th=[   40], 20.00th=[   41],
> |      | 30.00th=[   41], 40.00th=[   42], 50.00th=[   42], 60.00t848/h=[   42],
> |      | 70.00th=[   43], 80.00th=[   44], 90.00th=[   47], 95.00th=[   53],
> |      | 99.00th=[   71], 99.50th=[   87], 99.90th=[  157], 99.95th=[  201],
> |      | 99.99th=[  478]
> |     bw (KB/s)  : min=81096, max=91312, per=100.00%, avg=88519.19, stdev=1762.43
> |     lat (usec) : 50=92.42%, 100=7.28%, 250=0.27%, 500=0.02%, 750=0.01%
> |     lat (usec) : 1000=0.01%
> |     lat (msec) : 2=0.01%
> |   cpu          : usr=5.43%, sys=14.64%, ctx=1327888, majf=0, minf=6
> |   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> |      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> |      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> |      issued    : total=r=0/w=1327777/d=0, short=r=0/w=0/d=0
> | 
> | Run status group 0 (all jobs):
> |   WRITE: io=5186.7MB, aggrb=88516KB/s, minb=88516KB/s, maxb=88516KB/s, mint=60001msec, maxt=60001msec
> | 
> | Disk stats (read/write):
> |   sdl: ios=15/1326283, merge=0/0, ticks=1/47203, in_queue=46970, util=78.29%
>
> That doesn’t look too bad to me, specifically the 99.99th of 478 microseconds seems fine.
>
> The iostat during this run looks OK as well:
>
> Both do look pretty bad to to me.
>
> Your SSD with a nominal write speed of 850MB/s is doing 88MB/s at 80%
> utilization.
> The puny 400GB DC S3610 in my example earlier can do 400MB/s per Intel
> specs and was at 70% with 300MB/s (so half of it journal writes!).
> My experience with Intel SSDs (as mentioned before in this ML) is that
> their stated speeds can actually be achieved within about a 10% margin
> when used with Ceph, be it for pure journaling or as OSDs with inline
> journals.
I don't see how this makes any sense. Could you correct or explain it so
it does?

- 300MB/s at 4k is like 77k iops. The Intel 400GB DC S3610 spec[1] says
it does 25k. So I think you should be more specific in how you tested it.
- His ssd is rated at 15k random write ops[2], so it's exceeding that by
a bunch (both reported by iostat and fio around 22k) (but they don't
list a sequential rating)
- his command says bs=128k, but the output says 4k ... so he didn't
really run that command for that result, or it's bugged. (is this where
the confusion lies?)
- also note he didn't set -ioengine=... so depending on how the default
changes per version, others could be comparing psync or others to his
ioengine=sync, so that should be specifically stated for comparing results.


[1]
http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-s3610-series.html
[2]
https://www.micron.com/~/media/documents/products/data-sheet/ssd/s600dc_series_2_5_sas_ssd.pdf

>
> Your numbers here prove basically that you can't expect much more than
> 100MB/s (50MB/s effective) from these SSDs at the best of circumstances. 
>
> How do these numbers change when you do "normal" FS level writes w/o
> sync/direct in fio, so the only SYNCs coming from the FS for its journals?
> Or just direct to the device w/o FS, an even more "friendly" test.
> Large/long runs so you can observe things when the pagecache gets flushed.
>
> If this gets significant closer to the limits of your SSD, or in your case
> that of your 6Gb/s SAS2 link (so 600MB/s), then the proof is complete that
> the culprit is bad DSYNC handling of those SSDs.
>
> [...]
>
> | 
> | Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> | sdl               0.00     0.00    0.00 22182.60     0.00 88730.40     8.00     0.78    0.04    0.00    0.04   0.04  77.66
> | 
> | avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> |            4.21    0.04    2.51    3.40    0.00   89.83
> | 
> | Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> | sdl               0.00     0.00    0.00 22392.00     0.00 89568.00     8.00     0.79    0.04    0.00    0.04   0.04  79.26
> | 
> | avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> |            4.94    0.04    3.35    3.40    0.00   88.26
> [...]


-- 

--------------------------------------------
Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.maloney@xxxxxxxxxxxxxxxxxxxx
Internet: http://www.brockmann-consult.de
--------------------------------------------

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com