Re: Interpretation Guidance for Slow Requests

Christian Theune <ct@xxxxxxxxxxxxxxx> · Wed, 7 Dec 2016 12:39:11 +0100

Hi,
I’m now working with the raw device and getting interesting results. 

For one, I went through all reviews about the Micron DC S610 again and as always the devil is in the detail. I noticed that the test results are quite favorable, but I didn’t previously notice the caveat (which applies to SSDs in general) that precondition may be in order.

See http://www.storagereview.com/seagate_12002_micron_s600dc_enterprise_sas_ssd_review

The Micron in their tests shows quite extreme initial max latency until preconditioning settles.

I can relate to that as the SSDs that I put into the cluster last Friday (5 days ago) have quite different characteristics in my statistics compared to the ones I added this Monday evening (2 days ago). 

I took one of the early ones and evacuated the OSD to perform tests. Sebastian’s fio call for testing journal ability ended up like this at the current time:

| cartman06 ~ # fio --filename=/dev/sdl --direct=1 --sync=1 --rw=write --bs=128k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test
| journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
| fio-2.0.14
| Starting 1 process
| Jobs: 1 (f=1): [W] [100.0% done] [0K/88852K/0K /s] [0 /22.3K/0  iops] [eta 00m:00s]
| journal-test: (groupid=0, jobs=1): err= 0: pid=28606: Wed Dec  7 11:59:36 2016
|   write: io=5186.7MB, bw=88517KB/s, iops=22129 , runt= 60001msec
|     clat (usec): min=37 , max=1519 , avg=43.77, stdev=10.89
|      lat (usec): min=37 , max=1519 , avg=43.94, stdev=10.90
|     clat percentiles (usec):
|      |  1.00th=[   39],  5.00th=[   40], 10.00th=[   40], 20.00th=[   41],
|      | 30.00th=[   41], 40.00th=[   42], 50.00th=[   42], 60.00t848/h=[   42],
|      | 70.00th=[   43], 80.00th=[   44], 90.00th=[   47], 95.00th=[   53],
|      | 99.00th=[   71], 99.50th=[   87], 99.90th=[  157], 99.95th=[  201],
|      | 99.99th=[  478]
|     bw (KB/s)  : min=81096, max=91312, per=100.00%, avg=88519.19, stdev=1762.43
|     lat (usec) : 50=92.42%, 100=7.28%, 250=0.27%, 500=0.02%, 750=0.01%
|     lat (usec) : 1000=0.01%
|     lat (msec) : 2=0.01%
|   cpu          : usr=5.43%, sys=14.64%, ctx=1327888, majf=0, minf=6
|   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
|      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
|      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
|      issued    : total=r=0/w=1327777/d=0, short=r=0/w=0/d=0
| 
| Run status group 0 (all jobs):
|   WRITE: io=5186.7MB, aggrb=88516KB/s, minb=88516KB/s, maxb=88516KB/s, mint=60001msec, maxt=60001msec
| 
| Disk stats (read/write):
|   sdl: ios=15/1326283, merge=0/0, ticks=1/47203, in_queue=46970, util=78.29%

That doesn’t look too bad to me, specifically the 99.99th of 478 microseconds seems fine.

The iostat during this run looks OK as well:

| cartman06 ~ # iostat -x 5 sdl
| Linux 4.4.27-gentoo (cartman06)     12/07/2016  _x86_64_    (24 CPU)
| 
| avg-cpu:  %user   %nice %system %iowait  %steal   %idle
|            5.70    0.05    3.09    5.09    0.00   86.07
| 
| Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
| sdl               0.00     2.92   31.24  148.16  1851.99  8428.66   114.61     1.61    9.00    0.68   10.75   0.22   4.03
| 
| avg-cpu:  %user   %nice %system %iowait  %steal   %idle
|            3.31    0.04    1.97    1.48    0.00   93.19
| 
| Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
| sdl               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
| 
| avg-cpu:  %user   %nice %system %iowait  %steal   %idle
|            4.02    0.03    2.38    1.44    0.00   92.13
| 
| Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
| sdl               0.00     0.00   12.40 3101.40    92.80 12405.60     8.03     0.11    0.04    0.10    0.04   0.04  11.12
| 
| avg-cpu:  %user   %nice %system %iowait  %steal   %idle
|            4.02    0.05    3.57    4.78    0.00   87.58
| 
| Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
| sdl               0.00     0.00    0.00 22166.20     0.00 88664.80     8.00     0.80    0.04    0.00    0.04   0.04  79.58
| 
| avg-cpu:  %user   %nice %system %iowait  %steal   %idle
|            3.64    0.05    2.77    4.98    0.00   88.56
| 
| Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
| sdl               0.00     0.00    0.00 22304.20     0.00 89216.80     8.00     0.78    0.04    0.00    0.04   0.04  78.08
| 
| avg-cpu:  %user   %nice %system %iowait  %steal   %idle
|            4.89    0.05    2.97   11.15    0.00   80.93
| 
| Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
| sdl               0.00     0.00    0.00 22022.00     0.00 88088.00     8.00     0.79    0.04    0.00    0.04   0.04  78.68
| 
| avg-cpu:  %user   %nice %system %iowait  %steal   %idle
|            3.45    0.04    2.74    4.24    0.00   89.53
| 
| Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
| sdl               0.00     0.00    0.00 22182.60     0.00 88730.40     8.00     0.78    0.04    0.00    0.04   0.04  77.66
| 
| avg-cpu:  %user   %nice %system %iowait  %steal   %idle
|            4.21    0.04    2.51    3.40    0.00   89.83
| 
| Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
| sdl               0.00     0.00    0.00 22392.00     0.00 89568.00     8.00     0.79    0.04    0.00    0.04   0.04  79.26
| 
| avg-cpu:  %user   %nice %system %iowait  %steal   %idle
|            4.94    0.04    3.35    3.40    0.00   88.26
| 
| Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
| sdl               0.00     0.00    0.00 22078.40     0.00 88313.60     8.00     0.79    0.04    0.00    0.04   0.04  78.70
| 
| avg-cpu:  %user   %nice %system %iowait  %steal   %idle
|            4.43    0.04    3.02    4.68    0.00   87.83
| 
| Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
| sdl               0.00     0.00    0.00 22141.60     0.00 88566.40     8.00     0.77    0.04    0.00    0.04   0.03  77.24
| 
| avg-cpu:  %user   %nice %system %iowait  %steal   %idle
|            4.16    0.04    2.82    4.66    0.00   88.32
| 
| Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
| sdl               0.00     0.00    0.00 22177.00     0.00 88708.00     8.00     0.78    0.04    0.00    0.04   0.04  78.24
| 
| avg-cpu:  %user   %nice %system %iowait  %steal   %idle
|            4.09    0.03    3.02   12.34    0.00   80.52
| 
| Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
| sdl               0.00     0.00    0.00 22156.60     0.00 88626.40     8.00     0.78    0.04    0.00    0.04   0.04  78.36
| 
| avg-cpu:  %user   %nice %system %iowait  %steal   %idle
|            5.43    0.04    3.38    4.07    0.00   87.08
| 
| Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
| sdl               0.00     0.00    0.00 22298.80     0.00 89195.20     8.00     0.77    0.03    0.00    0.03   0.03  77.36
| 
| avg-cpu:  %user   %nice %system %iowait  %steal   %idle
|            7.33    0.05    4.42    4.58    0.00   83.62
| 
| Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
| sdl               0.00     0.00    0.00 21905.20     0.00 87620.80     8.00     0.79    0.04    0.00    0.04   0.04  79.20
| 
| avg-cpu:  %user   %nice %system %iowait  %steal   %idle
|            4.91    0.03    3.52    3.39    0.00   88.15
| 
| Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
| sdl               0.00     0.00   12.40 18629.40    92.80 74517.60     8.00     0.67    0.04    0.10    0.04   0.04  67.18

I’m now running  fio --filename=/dev/sdl  --rw=write --bs=128k --numjobs=1 --iodepth=32 --group_reporting --name=journal-test to condition the device fully. After that I’ll perform some more tests based on mixed loads.

On 7 Dec 2016, at 12:20, Christian Balzer <chibi@xxxxxxx> wrote:

I wasn't talking about abandoning Ceph up there, just that for this unit
(storage and VMs) a freeze might be the better, safer option.
The way it's operated makes that a possibility, others will of course
want/need to upgrade their clusters and keep them running as indefinitely
as possible.

I read your comment from a while ago about us all requiring some level of “insanity” to run mission critical OSS storage. ;)

Yup, something I did for our stuff (and not just Ceph SSDs) as well,
there's a nice Nagios plugin for this.

I’ll see if I can get this into our collectd somehow to feed into our graphing.

Could be housekeeping, could be pagecache flushes or other XFS ops.
Probably best to test/compare with a standalone SSD.

Hmm. I’ll see when I introduce additional abstraction layers: raw device, lvm, files on xfs. After that maybe also a mixture of two concurrently running fios one for files on XFS (OSD) and one for writing to the journal LVM.

If it were hooked up to the backplane (expander or individual connectors
per drive?) with just one link/lane (6Gb/s) that would indeed be a
noticeable bottleneck.
But I have a hard time imagining that.

If it were with just one mini-SAS connector aka 4 lanes to an expander
port it would halve your potential bandwidth but still be more than what
you're currently likely to produce there.

It’s a far fetch but then again it’s a really big list of (some very small) things that can go wrong to screw this everything up at the VM level … 

Cheers,
Christian

-- 
Christian Theune · ct@xxxxxxxxxxxxxxx · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com