Re: Interpretation Guidance for Slow Requests

Christian Balzer <chibi@xxxxxxx> · Wed, 7 Dec 2016 21:52:46 +0900

Hello,

On Wed, 7 Dec 2016 12:39:11 +0100 Christian Theune wrote:

> Hi,
> 
> I’m now working with the raw device and getting interesting results. 
> 
> For one, I went through all reviews about the Micron DC S610 again and as always the devil is in the detail. I noticed that the test results are quite favorable, but I didn’t previously notice the caveat (which applies to SSDs in general) that precondition may be in order.
> 
> See http://www.storagereview.com/seagate_12002_micron_s600dc_enterprise_sas_ssd_review <http://www.storagereview.com/seagate_12002_micron_s600dc_enterprise_sas_ssd_review>
> 
> The Micron in their tests shows quite extreme initial max latency until preconditioning settles.
> 
> I can relate to that as the SSDs that I put into the cluster last Friday (5 days ago) have quite different characteristics in my statistics compared to the ones I added this Monday evening (2 days ago). 
> 
Be that as it may, it's still lacking, see below.

> I took one of the early ones and evacuated the OSD to perform tests. Sebastian’s fio call for testing journal ability ended up like this at the current time:
> 
> | cartman06 ~ # fio --filename=/dev/sdl --direct=1 --sync=1 --rw=write --bs=128k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test
> | journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> | fio-2.0.14
> | Starting 1 process
> | Jobs: 1 (f=1): [W] [100.0% done] [0K/88852K/0K /s] [0 /22.3K/0  iops] [eta 00m:00s]
> | journal-test: (groupid=0, jobs=1): err= 0: pid=28606: Wed Dec  7 11:59:36 2016
> |   write: io=5186.7MB, bw=88517KB/s, iops=22129 , runt= 60001msec
> |     clat (usec): min=37 , max=1519 , avg=43.77, stdev=10.89
> |      lat (usec): min=37 , max=1519 , avg=43.94, stdev=10.90
> |     clat percentiles (usec):
> |      |  1.00th=[   39],  5.00th=[   40], 10.00th=[   40], 20.00th=[   41],
> |      | 30.00th=[   41], 40.00th=[   42], 50.00th=[   42], 60.00t848/h=[   42],
> |      | 70.00th=[   43], 80.00th=[   44], 90.00th=[   47], 95.00th=[   53],
> |      | 99.00th=[   71], 99.50th=[   87], 99.90th=[  157], 99.95th=[  201],
> |      | 99.99th=[  478]
> |     bw (KB/s)  : min=81096, max=91312, per=100.00%, avg=88519.19, stdev=1762.43
> |     lat (usec) : 50=92.42%, 100=7.28%, 250=0.27%, 500=0.02%, 750=0.01%
> |     lat (usec) : 1000=0.01%
> |     lat (msec) : 2=0.01%
> |   cpu          : usr=5.43%, sys=14.64%, ctx=1327888, majf=0, minf=6
> |   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> |      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> |      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> |      issued    : total=r=0/w=1327777/d=0, short=r=0/w=0/d=0
> | 
> | Run status group 0 (all jobs):
> |   WRITE: io=5186.7MB, aggrb=88516KB/s, minb=88516KB/s, maxb=88516KB/s, mint=60001msec, maxt=60001msec
> | 
> | Disk stats (read/write):
> |   sdl: ios=15/1326283, merge=0/0, ticks=1/47203, in_queue=46970, util=78.29%
> 
> That doesn’t look too bad to me, specifically the 99.99th of 478 microseconds seems fine.
> 
> The iostat during this run looks OK as well:
> 
Both do look pretty bad to to me.

Your SSD with a nominal write speed of 850MB/s is doing 88MB/s at 80%
utilization.
The puny 400GB DC S3610 in my example earlier can do 400MB/s per Intel
specs and was at 70% with 300MB/s (so half of it journal writes!).
My experience with Intel SSDs (as mentioned before in this ML) is that
their stated speeds can actually be achieved within about a 10% margin
when used with Ceph, be it for pure journaling or as OSDs with inline
journals.

Your numbers here prove basically that you can't expect much more than
100MB/s (50MB/s effective) from these SSDs at the best of circumstances. 

How do these numbers change when you do "normal" FS level writes w/o
sync/direct in fio, so the only SYNCs coming from the FS for its journals?
Or just direct to the device w/o FS, an even more "friendly" test.
Large/long runs so you can observe things when the pagecache gets flushed.

If this gets significant closer to the limits of your SSD, or in your case
that of your 6Gb/s SAS2 link (so 600MB/s), then the proof is complete that
the culprit is bad DSYNC handling of those SSDs.

> | cartman06 ~ # iostat -x 5 sdl
> | Linux 4.4.27-gentoo (cartman06)     12/07/2016  _x86_64_    (24 CPU)
> | 
> | avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> |            5.70    0.05    3.09    5.09    0.00   86.07
> | 
> | Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> | sdl               0.00     2.92   31.24  148.16  1851.99  8428.66   114.61     1.61    9.00    0.68   10.75   0.22   4.03
> | 
> | avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> |            3.31    0.04    1.97    1.48    0.00   93.19
> | 
> | Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> | sdl               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
> | 
> | avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> |            4.02    0.03    2.38    1.44    0.00   92.13
> | 
> | Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> | sdl               0.00     0.00   12.40 3101.40    92.80 12405.60     8.03     0.11    0.04    0.10    0.04   0.04  11.12
> | 
> | avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> |            4.02    0.05    3.57    4.78    0.00   87.58
> | 
> | Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> | sdl               0.00     0.00    0.00 22166.20     0.00 88664.80     8.00     0.80    0.04    0.00    0.04   0.04  79.58
> | 
> | avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> |            3.64    0.05    2.77    4.98    0.00   88.56
> | 
> | Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> | sdl               0.00     0.00    0.00 22304.20     0.00 89216.80     8.00     0.78    0.04    0.00    0.04   0.04  78.08
> | 
> | avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> |            4.89    0.05    2.97   11.15    0.00   80.93
> | 
> | Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> | sdl               0.00     0.00    0.00 22022.00     0.00 88088.00     8.00     0.79    0.04    0.00    0.04   0.04  78.68
> | 
> | avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> |            3.45    0.04    2.74    4.24    0.00   89.53
> | 
> | Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> | sdl               0.00     0.00    0.00 22182.60     0.00 88730.40     8.00     0.78    0.04    0.00    0.04   0.04  77.66
> | 
> | avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> |            4.21    0.04    2.51    3.40    0.00   89.83
> | 
> | Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> | sdl               0.00     0.00    0.00 22392.00     0.00 89568.00     8.00     0.79    0.04    0.00    0.04   0.04  79.26
> | 
> | avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> |            4.94    0.04    3.35    3.40    0.00   88.26
> | 
> | Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> | sdl               0.00     0.00    0.00 22078.40     0.00 88313.60     8.00     0.79    0.04    0.00    0.04   0.04  78.70
> | 
> | avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> |            4.43    0.04    3.02    4.68    0.00   87.83
> | 
> | Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> | sdl               0.00     0.00    0.00 22141.60     0.00 88566.40     8.00     0.77    0.04    0.00    0.04   0.03  77.24
> | 
> | avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> |            4.16    0.04    2.82    4.66    0.00   88.32
> | 
> | Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> | sdl               0.00     0.00    0.00 22177.00     0.00 88708.00     8.00     0.78    0.04    0.00    0.04   0.04  78.24
> | 
> | avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> |            4.09    0.03    3.02   12.34    0.00   80.52
> | 
> | Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> | sdl               0.00     0.00    0.00 22156.60     0.00 88626.40     8.00     0.78    0.04    0.00    0.04   0.04  78.36
> | 
> | avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> |            5.43    0.04    3.38    4.07    0.00   87.08
> | 
> | Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> | sdl               0.00     0.00    0.00 22298.80     0.00 89195.20     8.00     0.77    0.03    0.00    0.03   0.03  77.36
> | 
> | avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> |            7.33    0.05    4.42    4.58    0.00   83.62
> | 
> | Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> | sdl               0.00     0.00    0.00 21905.20     0.00 87620.80     8.00     0.79    0.04    0.00    0.04   0.04  79.20
> | 
> | avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> |            4.91    0.03    3.52    3.39    0.00   88.15
> | 
> | Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> | sdl               0.00     0.00   12.40 18629.40    92.80 74517.60     8.00     0.67    0.04    0.10    0.04   0.04  67.18
> 
> 
> I’m now running  fio --filename=/dev/sdl  --rw=write --bs=128k --numjobs=1 --iodepth=32 --group_reporting --name=journal-test to condition the device fully. After that I’ll perform some more tests based on mixed loads.
> 
> > On 7 Dec 2016, at 12:20, Christian Balzer <chibi@xxxxxxx> wrote:
> > 
> > I wasn't talking about abandoning Ceph up there, just that for this unit
> > (storage and VMs) a freeze might be the better, safer option.
> > The way it's operated makes that a possibility, others will of course
> > want/need to upgrade their clusters and keep them running as indefinitely
> > as possible.
> 
> I read your comment from a while ago about us all requiring some level of “insanity” to run mission critical OSS storage. ;)
> 
We all tend to trust the OSS kernel and things like file systems (sometimes
wrongly so), but the complexity of Ceph coupled with its comparatively
small user base makes it of course a bigger risk.

> > Yup, something I did for our stuff (and not just Ceph SSDs) as well,
> > there's a nice Nagios plugin for this.
> 
> I’ll see if I can get this into our collectd somehow to feed into our graphing.
> 
That's also more or less straightforward, but requires somebody to stare at
things. 
So a nice addition, but no replacement for proactive automatic monitoring
and alerting.

Christian

> > Could be housekeeping, could be pagecache flushes or other XFS ops.
> > Probably best to test/compare with a standalone SSD.
> 
> Hmm. I’ll see when I introduce additional abstraction layers: raw device, lvm, files on xfs. After that maybe also a mixture of two concurrently running fios one for files on XFS (OSD) and one for writing to the journal LVM.
> 
> > If it were hooked up to the backplane (expander or individual connectors
> > per drive?) with just one link/lane (6Gb/s) that would indeed be a
> > noticeable bottleneck.
> > But I have a hard time imagining that.
> > 
> > If it were with just one mini-SAS connector aka 4 lanes to an expander
> > port it would halve your potential bandwidth but still be more than what
> > you're currently likely to produce there.
> 
> It’s a far fetch but then again it’s a really big list of (some very small) things that can go wrong to screw this everything up at the VM level … 
> 
> Cheers,
> Christian
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com