On 12/07/16 13:52, Christian Balzer wrote: > On Wed, 7 Dec 2016 12:39:11 +0100 Christian Theune wrote: > > | cartman06 ~ # fio --filename=/dev/sdl --direct=1 --sync=1 --rw=write --bs=128k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test > | journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 > | fio-2.0.14 > | Starting 1 process > | Jobs: 1 (f=1): [W] [100.0% done] [0K/88852K/0K /s] [0 /22.3K/0 iops] [eta 00m:00s] > | journal-test: (groupid=0, jobs=1): err= 0: pid=28606: Wed Dec 7 11:59:36 2016 > | write: io=5186.7MB, bw=88517KB/s, iops=22129 , runt= 60001msec > | clat (usec): min=37 , max=1519 , avg=43.77, stdev=10.89 > | lat (usec): min=37 , max=1519 , avg=43.94, stdev=10.90 > | clat percentiles (usec): > | | 1.00th=[ 39], 5.00th=[ 40], 10.00th=[ 40], 20.00th=[ 41], > | | 30.00th=[ 41], 40.00th=[ 42], 50.00th=[ 42], 60.00t848/h=[ 42], > | | 70.00th=[ 43], 80.00th=[ 44], 90.00th=[ 47], 95.00th=[ 53], > | | 99.00th=[ 71], 99.50th=[ 87], 99.90th=[ 157], 99.95th=[ 201], > | | 99.99th=[ 478] > | bw (KB/s) : min=81096, max=91312, per=100.00%, avg=88519.19, stdev=1762.43 > | lat (usec) : 50=92.42%, 100=7.28%, 250=0.27%, 500=0.02%, 750=0.01% > | lat (usec) : 1000=0.01% > | lat (msec) : 2=0.01% > | cpu : usr=5.43%, sys=14.64%, ctx=1327888, majf=0, minf=6 > | IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > | submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > | complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > | issued : total=r=0/w=1327777/d=0, short=r=0/w=0/d=0 > | > | Run status group 0 (all jobs): > | WRITE: io=5186.7MB, aggrb=88516KB/s, minb=88516KB/s, maxb=88516KB/s, mint=60001msec, maxt=60001msec > | > | Disk stats (read/write): > | sdl: ios=15/1326283, merge=0/0, ticks=1/47203, in_queue=46970, util=78.29% > > That doesn’t look too bad to me, specifically the 99.99th of 478 microseconds seems fine. > > The iostat during this run looks OK as well: > > Both do look pretty bad to to me. > > Your SSD with a nominal write speed of 850MB/s is doing 88MB/s at 80% > utilization. > The puny 400GB DC S3610 in my example earlier can do 400MB/s per Intel > specs and was at 70% with 300MB/s (so half of it journal writes!). > My experience with Intel SSDs (as mentioned before in this ML) is that > their stated speeds can actually be achieved within about a 10% margin > when used with Ceph, be it for pure journaling or as OSDs with inline > journals. I don't see how this makes any sense. Could you correct or explain it so it does? - 300MB/s at 4k is like 77k iops. The Intel 400GB DC S3610 spec[1] says it does 25k. So I think you should be more specific in how you tested it. - His ssd is rated at 15k random write ops[2], so it's exceeding that by a bunch (both reported by iostat and fio around 22k) (but they don't list a sequential rating) - his command says bs=128k, but the output says 4k ... so he didn't really run that command for that result, or it's bugged. (is this where the confusion lies?) - also note he didn't set -ioengine=... so depending on how the default changes per version, others could be comparing psync or others to his ioengine=sync, so that should be specifically stated for comparing results. [1] http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-s3610-series.html [2] https://www.micron.com/~/media/documents/products/data-sheet/ssd/s600dc_series_2_5_sas_ssd.pdf > > Your numbers here prove basically that you can't expect much more than > 100MB/s (50MB/s effective) from these SSDs at the best of circumstances. > > How do these numbers change when you do "normal" FS level writes w/o > sync/direct in fio, so the only SYNCs coming from the FS for its journals? > Or just direct to the device w/o FS, an even more "friendly" test. > Large/long runs so you can observe things when the pagecache gets flushed. > > If this gets significant closer to the limits of your SSD, or in your case > that of your 6Gb/s SAS2 link (so 600MB/s), then the proof is complete that > the culprit is bad DSYNC handling of those SSDs. > > [...] > > | > | Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > | sdl 0.00 0.00 0.00 22182.60 0.00 88730.40 8.00 0.78 0.04 0.00 0.04 0.04 77.66 > | > | avg-cpu: %user %nice %system %iowait %steal %idle > | 4.21 0.04 2.51 3.40 0.00 89.83 > | > | Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > | sdl 0.00 0.00 0.00 22392.00 0.00 89568.00 8.00 0.79 0.04 0.00 0.04 0.04 79.26 > | > | avg-cpu: %user %nice %system %iowait %steal %idle > | 4.94 0.04 3.35 3.40 0.00 88.26 > [...] -- -------------------------------------------- Peter Maloney Brockmann Consult Max-Planck-Str. 2 21502 Geesthacht Germany Tel: +49 4152 889 300 Fax: +49 4152 889 333 E-mail: peter.maloney@xxxxxxxxxxxxxxxxxxxx Internet: http://www.brockmann-consult.de -------------------------------------------- _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com